首页 > 最新文献

IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献

英文 中文
Spherically Steerable Vector Differential Microphone Arrays 球面可转向矢量差分麦克风阵列
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-10 DOI: 10.1109/TASLP.2024.3458799
Hüseyin Hacıhabiboğlu
Differential microphone arrays (DMAs) use multiple omnidirectional microphones for synthesising higher-order microphone directivity patterns. In their most basic form, they can be used to obtain fixed-directivity or horizontally steerable beamformers that can satisfy certain constraints. We propose a vector differential microphone array (VDMA) which is frequency- and direction-invariantly steerable in three dimensions. The proposed design comprises pressure and particle velocity sensors positioned on a circular constellation in a plane and allows extracting the third-order spherical harmonic decomposition of the sound field. This decomposition can then be used to obtain spherically direction-invariant steered beams. Synthesis of a maximum directivity factor (MaxDF) directivity pattern is demonstrated. A closed-form expression for the proposed array's white noise gain (WNG) is derived. The robustness of the proposed design to noise is analysed.
差分传声器阵列(DMA)使用多个全向传声器合成高阶传声器指向性模式。在最基本的形式中,它们可用于获得固定指向性或水平转向波束成形器,并能满足某些约束条件。我们提出了一种矢量差分传声器阵列(VDMA),它在三个维度上具有频率和方向可变的转向性。所提出的设计包括压力和粒子速度传感器,它们被放置在平面上的一个圆形星座上,可以提取声场的三阶球形谐波分解。然后,可以利用这种分解来获得球面方向不变的转向波束。演示了最大指向性系数(MaxDF)指向性模式的合成。得出了拟议阵列白噪声增益(WNG)的闭式表达式。分析了拟议设计对噪声的鲁棒性。
{"title":"Spherically Steerable Vector Differential Microphone Arrays","authors":"Hüseyin Hacıhabiboğlu","doi":"10.1109/TASLP.2024.3458799","DOIUrl":"10.1109/TASLP.2024.3458799","url":null,"abstract":"Differential microphone arrays (DMAs) use multiple omnidirectional microphones for synthesising higher-order microphone directivity patterns. In their most basic form, they can be used to obtain fixed-directivity or horizontally steerable beamformers that can satisfy certain constraints. We propose a vector differential microphone array (VDMA) which is frequency- and direction-invariantly steerable in three dimensions. The proposed design comprises pressure and particle velocity sensors positioned on a circular constellation in a plane and allows extracting the third-order spherical harmonic decomposition of the sound field. This decomposition can then be used to obtain spherically direction-invariant steered beams. Synthesis of a maximum directivity factor (MaxDF) directivity pattern is demonstrated. A closed-form expression for the proposed array's white noise gain (WNG) is derived. The robustness of the proposed design to noise is analysed.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4342-4354"},"PeriodicalIF":4.1,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-Supervised Learning of Spatial Acoustic Representation With Cross-Channel Signal Reconstruction and Multi-Channel Conformer 利用跨信道信号重构和多信道适形器进行空间声学表征的自我监督学习
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-10 DOI: 10.1109/TASLP.2024.3458811
Bing Yang;Xiaofei Li
Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the deficiency of annotated real-world data. To this end, this work proposes a self-supervised method that takes full advantage of unlabeled data for spatial acoustic parameter estimation. First, a new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed to learn a universal spatial acoustic representation from unlabeled multi-channel microphone signals. We mask partial signals of one channel and ask the model to reconstruct them, which makes it possible to learn spatial acoustic information from unmasked signals and extract source information from the other microphone channel. An encoder-decoder structure is used to disentangle the two kinds of information. By fine-tuning the pre-trained spatial encoder with a small annotated dataset, this encoder can be used to estimate spatial acoustic parameters. Second, a novel multi-channel audio Conformer (MC-Conformer) is adopted as the encoder model architecture, which is suitable for both the pretext and downstream tasks. It is carefully designed to be able to capture the local and global characteristics of spatial acoustics exhibited in the time-frequency domain. Experimental results of five acoustic parameter estimation tasks on both simulated and real-world data show the effectiveness of the proposed method. To the best of our knowledge, this is the first self-supervised learning method in the field of spatial acoustic representation learning and multi-channel audio signal processing.
有监督的学习方法在估计空间声学参数(如到达时间差、直接混响比和混响时间)方面显示出了有效性。然而,由于模拟声学特征与真实世界声学特征之间的不匹配以及注释真实世界数据的不足,这些方法仍然存在模拟到现实的泛化问题。为此,本研究提出了一种自监督方法,充分利用无标注数据进行空间声学参数估计。首先,我们设计了一个新的前置任务,即跨信道信号重建(CCSR),以从未标明的多信道麦克风信号中学习通用的空间声学表示。我们屏蔽一个信道的部分信号,并要求模型对其进行重建,这样就可以从未获屏蔽的信号中学习空间声学信息,并从另一个麦克风信道中提取声源信息。编码器-解码器结构用于分离这两种信息。通过使用小型注释数据集对预先训练的空间编码器进行微调,该编码器可用于估算空间声学参数。其次,编码器模型结构采用了新颖的多通道音频变换器(MC-Conformer),它既适用于前置任务,也适用于下游任务。它经过精心设计,能够捕捉时频域空间声学的局部和全局特征。在模拟和真实世界数据上进行的五项声学参数估计任务的实验结果表明了所提方法的有效性。据我们所知,这是空间声学表示学习和多通道音频信号处理领域的第一个自监督学习方法。
{"title":"Self-Supervised Learning of Spatial Acoustic Representation With Cross-Channel Signal Reconstruction and Multi-Channel Conformer","authors":"Bing Yang;Xiaofei Li","doi":"10.1109/TASLP.2024.3458811","DOIUrl":"10.1109/TASLP.2024.3458811","url":null,"abstract":"Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the deficiency of annotated real-world data. To this end, this work proposes a self-supervised method that takes full advantage of unlabeled data for spatial acoustic parameter estimation. First, a new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed to learn a universal spatial acoustic representation from unlabeled multi-channel microphone signals. We mask partial signals of one channel and ask the model to reconstruct them, which makes it possible to learn spatial acoustic information from unmasked signals and extract source information from the other microphone channel. An encoder-decoder structure is used to disentangle the two kinds of information. By fine-tuning the pre-trained spatial encoder with a small annotated dataset, this encoder can be used to estimate spatial acoustic parameters. Second, a novel multi-channel audio Conformer (MC-Conformer) is adopted as the encoder model architecture, which is suitable for both the pretext and downstream tasks. It is carefully designed to be able to capture the local and global characteristics of spatial acoustics exhibited in the time-frequency domain. Experimental results of five acoustic parameter estimation tasks on both simulated and real-world data show the effectiveness of the proposed method. To the best of our knowledge, this is the first self-supervised learning method in the field of spatial acoustic representation learning and multi-channel audio signal processing.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4211-4225"},"PeriodicalIF":4.1,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations ZMM-TTS:以自监督离散语音表征为条件的零镜头多语言和多发言人语音合成
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-06 DOI: 10.1109/TASLP.2024.3451951
Cheng Gong;Xin Wang;Erica Cooper;Dan Wells;Longbiao Wang;Jianwu Dang;Korin Richmond;Junichi Yamagishi
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voice, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.
神经文本到语音(TTS)已在单人、单语种合成中实现了类人合成语音。由于缺乏大量配对文本和录音室质量的音频数据,多语言 TTS 系统仅限于资源丰富的语言。TTS 系统通常使用单个说话者的语音来构建,但人们对开发只使用几秒钟说话者语音就能为新说话者合成语音的系统越来越感兴趣。本文介绍了 ZMM-TTS,这是一种多语言和多发言人框架,它利用来自大规模预训练自监督模型的量化潜在语音表示。我们的论文结合了基于文本和基于语音的自监督学习模型,用于多语言语音合成。我们提出的模型不仅对未见过的说话人,而且对未见过的语言都具有零点泛化能力。我们通过一系列实验进行了全面的主观和客观评估。事实证明,我们的模型对六种高资源语言中见过和没见过的说话人都能有效地提高语音自然度和相似度。我们还在两种假设的低资源语言上测试了我们方法的效率。结果很有希望,表明我们提出的方法即使在没有任何新的、未见过的语言的训练数据的情况下,也能合成可理解的、与目标说话人的声音高度相似的音频。
{"title":"ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations","authors":"Cheng Gong;Xin Wang;Erica Cooper;Dan Wells;Longbiao Wang;Jianwu Dang;Korin Richmond;Junichi Yamagishi","doi":"10.1109/TASLP.2024.3451951","DOIUrl":"10.1109/TASLP.2024.3451951","url":null,"abstract":"Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voice, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4036-4051"},"PeriodicalIF":4.1,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10669054","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
U-Style: Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning U-Style:级联 U 网与多级扬声器和风格建模,实现零镜头语音克隆
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-06 DOI: 10.1109/TASLP.2024.3453606
Tao Li;Zhichao Wang;Xinfa Zhu;Jian Cong;Qiao Tian;Yuping Wang;Lei Xie
Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.
零镜头说话人克隆的目的是,在仅有一个说话人语音参考的情况下,为 TTS 系统构建过程中未见过的任何目标说话人合成语音。尽管在实际应用中更为实用,但目前的零镜头方法仍会产生自然度和说话人相似度不理想的语音。此外,在零拍摄设置中赋予目标说话人任意说话风格的问题也尚未得到考虑。这是因为零镜头说话人和风格克隆的独特挑战在于如何从代表任意说话人和任意风格的简短参考资料中学习到分离的说话人和风格表征。为了应对这一挑战,我们提出了 U-Style,它采用 Grad-TTS 作为骨干,特别是在文本编码器和扩散解码器之间级联了特定于说话人的编码器和特定于风格的编码器。因此,利用信号扰动,U-Style 被明确分解为特定于说话人和特定于风格的建模部分,实现了更好的说话人和风格分离。为了提高未见说话人和风格的建模能力,这两个编码器通过跳接 U 网进行多层次的说话人和风格建模,将表征提取和信息重构过程融入其中。此外,为了提高合成语音的自然度,我们在这两种编码器中分别采用了基于均值的实例归一化和风格自适应层归一化来进行表征提取和条件自适应。实验表明,U-Style 在未见说话人克隆方面的自然度和说话人相似度明显优于最先进的方法。值得注意的是,U-Style 可以将未见过的源扬声器的风格转移到另一个未见过的目标扬声器上,从而在零镜头语音克隆中实现所需的扬声器音色和风格的灵活组合。
{"title":"U-Style: Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning","authors":"Tao Li;Zhichao Wang;Xinfa Zhu;Jian Cong;Qiao Tian;Yuping Wang;Lei Xie","doi":"10.1109/TASLP.2024.3453606","DOIUrl":"10.1109/TASLP.2024.3453606","url":null,"abstract":"Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of \u0000<italic>zero-shot speaker and style cloning</i>\u0000 is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose \u0000<italic>U-Style</i>\u0000, which employs Grad-TTS as the backbone, particularly cascading a \u0000<italic>speaker-specific encoder</i>\u0000 and a \u0000<italic>style-specific encoder</i>\u0000 between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4026-4035"},"PeriodicalIF":4.1,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Blind Identification of Binaural Room Impulse Responses From Smart Glasses 从智能眼镜盲识别双耳室内脉冲响应
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-05 DOI: 10.1109/TASLP.2024.3454964
Thomas Deppisch;Nils Meyer-Kahlen;Sebastià V. Amengual Garí
Smart glasses are increasingly recognized as a key medium for augmented reality, offering a hands-free platform with integrated microphones and non-ear-occluding loudspeakers to seamlessly mix virtual sound sources into the real-world acoustic scene. To convincingly integrate virtual sound sources, the room acoustic rendering of the virtual sources must match the real-world acoustics. Information about a user's acoustic environment however is typically not available. This work uses a microphone array in a pair of smart glasses to blindly identify binaural room impulse responses (BRIRs) from a few seconds of speech in the real-world environment. The proposed method uses dereverberation and beamforming to generate a pseudo reference signal that is used by a multichannel Wiener filter to estimate room impulse responses which are then converted to BRIRs. The multichannel room impulse responses can be used to estimate room acoustic parameters which is shown to outperform baseline algorithms in the estimation of reverberation time and direct-to-reverberant energy ratio. Results from a listening experiment further indicate that the estimated BRIRs often reproduce the real-world room acoustics perceptually more convincingly than measured BRIRs from other rooms of similar size.
智能眼镜作为增强现实技术的一个重要媒介,其提供的免提平台集成了麦克风和不包括耳朵的扬声器,可将虚拟声源无缝地混合到现实世界的声学场景中,这一点正日益得到认可。要令人信服地整合虚拟声源,虚拟声源的室内声学渲染必须与真实世界的声学效果相匹配。然而,用户的声学环境信息通常是不可用的。这项工作利用一副智能眼镜中的麦克风阵列,从现实世界环境中几秒钟的语音中盲目识别双耳房间脉冲响应(BRIR)。所提出的方法利用去混响和波束成形来生成伪参考信号,该信号被多通道维纳滤波器用于估计房间脉冲响应,然后将其转换为双耳房间脉冲响应。多通道房间脉冲响应可用于估算房间声学参数,在估算混响时间和直接与混响能量比方面优于基准算法。听音实验的结果进一步表明,估算出的混响时间与混响能量比通常比从其他类似大小的房间测得的混响时间与混响能量比更令人信服地再现了真实世界的房间声学效果。
{"title":"Blind Identification of Binaural Room Impulse Responses From Smart Glasses","authors":"Thomas Deppisch;Nils Meyer-Kahlen;Sebastià V. Amengual Garí","doi":"10.1109/TASLP.2024.3454964","DOIUrl":"10.1109/TASLP.2024.3454964","url":null,"abstract":"Smart glasses are increasingly recognized as a key medium for augmented reality, offering a hands-free platform with integrated microphones and non-ear-occluding loudspeakers to seamlessly mix virtual sound sources into the real-world acoustic scene. To convincingly integrate virtual sound sources, the room acoustic rendering of the virtual sources must match the real-world acoustics. Information about a user's acoustic environment however is typically not available. This work uses a microphone array in a pair of smart glasses to blindly identify binaural room impulse responses (BRIRs) from a few seconds of speech in the real-world environment. The proposed method uses dereverberation and beamforming to generate a pseudo reference signal that is used by a multichannel Wiener filter to estimate room impulse responses which are then converted to BRIRs. The multichannel room impulse responses can be used to estimate room acoustic parameters which is shown to outperform baseline algorithms in the estimation of reverberation time and direct-to-reverberant energy ratio. Results from a listening experiment further indicate that the estimated BRIRs often reproduce the real-world room acoustics perceptually more convincingly than measured BRIRs from other rooms of similar size.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4052-4065"},"PeriodicalIF":4.1,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition 利用说话人信息和电话掩码完善合成语音,实现语音识别的数据增强
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-09-03 DOI: 10.1109/TASLP.2024.3451982
Sei Ueno;Akinobu Lee;Tatsuya Kawahara
While end-to-end automatic speech recognition (ASR) has shown impressive performance, it requires a huge amount of speech and transcription data. The conversion of domain-matched text to speech (TTS) has been investigated as one approach to data augmentation. The quality and diversity of the synthesized speech are critical in this approach. To ensure quality, a neural vocoder is widely used to generate speech waveforms in conventional studies, but it requires a huge amount of computation and another conversion to spectral-domain features such as the log-Mel filterbank (lmfb) output typically used for ASR. In this study, we explore the direct refinement of these features. Unlike conventional speech enhancement, we can use information on the ground-truth phone sequences of the speech and designated speaker to improve the quality and diversity. This process is realized as a Mel-to-Mel network, which can be placed after a text-to-Mel synthesis system such as FastSpeech 2. These two networks can be trained jointly. Moreover, semantic masking is applied to the lmfb features for robust training. Experimental evaluations demonstrate the effect of phone information, speaker information, and semantic masking. For speaker information, x-vector performs better than the simple speaker embedding. The proposed method achieves even better ASR performance with a much shorter computation time than the conventional method using a vocoder.
虽然端到端自动语音识别(ASR)已显示出令人印象深刻的性能,但它需要大量的语音和转录数据。将领域匹配的文本转换为语音 (TTS) 作为数据扩增的一种方法进行了研究。在这种方法中,合成语音的质量和多样性至关重要。为确保质量,传统研究中广泛使用神经声码器生成语音波形,但这需要大量计算,还需要转换为频谱域特征,如通常用于 ASR 的 log-Mel filterbank(lmfb)输出。在本研究中,我们探索了直接改进这些特征的方法。与传统的语音增强不同,我们可以利用语音和指定说话人的真实电话序列信息来提高质量和多样性。这一过程以 Mel-to-Mel 网络的形式实现,该网络可置于文本到 Mel 的合成系统(如 FastSpeech 2)之后。这两个网络可以联合训练。此外,还对 lmfb 特征进行了语义屏蔽,以实现稳健的训练。实验评估证明了电话信息、说话者信息和语义屏蔽的效果。对于说话人信息,x-vector 比简单的说话人嵌入效果更好。与使用声码器的传统方法相比,建议的方法以更短的计算时间实现了更好的 ASR 性能。
{"title":"Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition","authors":"Sei Ueno;Akinobu Lee;Tatsuya Kawahara","doi":"10.1109/TASLP.2024.3451982","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3451982","url":null,"abstract":"While end-to-end automatic speech recognition (ASR) has shown impressive performance, it requires a huge amount of speech and transcription data. The conversion of domain-matched text to speech (TTS) has been investigated as one approach to data augmentation. The quality and diversity of the synthesized speech are critical in this approach. To ensure quality, a neural vocoder is widely used to generate speech waveforms in conventional studies, but it requires a huge amount of computation and another conversion to spectral-domain features such as the log-Mel filterbank (lmfb) output typically used for ASR. In this study, we explore the direct refinement of these features. Unlike conventional speech enhancement, we can use information on the ground-truth phone sequences of the speech and designated speaker to improve the quality and diversity. This process is realized as a Mel-to-Mel network, which can be placed after a text-to-Mel synthesis system such as FastSpeech 2. These two networks can be trained jointly. Moreover, semantic masking is applied to the lmfb features for robust training. Experimental evaluations demonstrate the effect of phone information, speaker information, and semantic masking. For speaker information, x-vector performs better than the simple speaker embedding. The proposed method achieves even better ASR performance with a much shorter computation time than the conventional method using a vocoder.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3924-3933"},"PeriodicalIF":4.1,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142143736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sound Activity-Aware Based Cross-Task Collaborative Training for Semi-Supervised Sound Event Detection 基于声音活动感知的跨任务协作训练用于半监督声音事件检测
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-29 DOI: 10.1109/TASLP.2024.3451983
Yadong Guan;Jiqing Han;Hongwei Song;Shiwen Deng;Guibin Zheng;Tieran Zheng;Yongjun He
The training of sound event detection (SED) models remains a challenge of insufficient supervision due to limited frame-wise labeled data. Mainstream research on this problem has adopted semi-supervised training strategies that generate pseudo-labels for unlabeled data and use these data for the training of a model. Recent works further introduce multi-task training strategies to impose additional supervision. However, the auxiliary tasks employed in these methods either lack frame-wise guidance or exhibit unsuitable task designs. Furthermore, they fail to exploit inter-task relationships effectively, which can serve as valuable supervision. In this paper, we introduce a novel task, sound occurrence and overlap detection (SOD), which detects predefined sound activity patterns, including non-overlapping and overlapping cases. On the basis of SOD, we propose a cross-task collaborative training framework that leverages the relationship between SED and SOD to improve the SED model. Firstly, by jointly optimizing the two tasks in a multi-task manner, the SED model is encouraged to learn features sensitive to sound activity. Subsequently, the cross-task consistency regularization is proposed to promote consistent predictions between SED and SOD. Finally, we propose a pseudo-label selection method that uses inconsistent predictions between the two tasks to identify potential wrong pseudo-labels and mitigate their confirmation bias. In the inference phase, only the trained SED model is used, thus no additional computation and storage costs are incurred. Extensive experiments on the DESED dataset demonstrate the effectiveness of our method.
由于帧标注数据有限,声音事件检测(SED)模型的训练仍然面临监督不足的挑战。针对这一问题的主流研究采用了半监督训练策略,即为未标记数据生成伪标签,并使用这些数据来训练模型。最近的研究进一步引入了多任务训练策略,以施加额外的监督。然而,这些方法采用的辅助任务要么缺乏框架指导,要么任务设计不合适。此外,这些方法未能有效利用任务间的关系,而这种关系可以起到宝贵的监督作用。在本文中,我们引入了一项新任务--声音发生和重叠检测(SOD),它能检测预定义的声音活动模式,包括非重叠和重叠情况。在 SOD 的基础上,我们提出了一个跨任务协同训练框架,利用 SED 和 SOD 之间的关系来改进 SED 模型。首先,通过多任务方式联合优化两个任务,鼓励 SED 模型学习对声音活动敏感的特征。随后,我们提出了跨任务一致性正则化,以促进 SED 和 SOD 预测的一致性。最后,我们提出了一种伪标签选择方法,利用两个任务之间不一致的预测来识别潜在的错误伪标签,并减轻其确认偏差。在推理阶段,只使用训练有素的 SED 模型,因此不会产生额外的计算和存储成本。在 DESED 数据集上进行的大量实验证明了我们方法的有效性。
{"title":"Sound Activity-Aware Based Cross-Task Collaborative Training for Semi-Supervised Sound Event Detection","authors":"Yadong Guan;Jiqing Han;Hongwei Song;Shiwen Deng;Guibin Zheng;Tieran Zheng;Yongjun He","doi":"10.1109/TASLP.2024.3451983","DOIUrl":"10.1109/TASLP.2024.3451983","url":null,"abstract":"The training of sound event detection (SED) models remains a challenge of insufficient supervision due to limited frame-wise labeled data. Mainstream research on this problem has adopted semi-supervised training strategies that generate pseudo-labels for unlabeled data and use these data for the training of a model. Recent works further introduce multi-task training strategies to impose additional supervision. However, the auxiliary tasks employed in these methods either lack frame-wise guidance or exhibit unsuitable task designs. Furthermore, they fail to exploit inter-task relationships effectively, which can serve as valuable supervision. In this paper, we introduce a novel task, sound occurrence and overlap detection (SOD), which detects predefined sound activity patterns, including non-overlapping and overlapping cases. On the basis of SOD, we propose a cross-task collaborative training framework that leverages the relationship between SED and SOD to improve the SED model. Firstly, by jointly optimizing the two tasks in a multi-task manner, the SED model is encouraged to learn features sensitive to sound activity. Subsequently, the cross-task consistency regularization is proposed to promote consistent predictions between SED and SOD. Finally, we propose a pseudo-label selection method that uses inconsistent predictions between the two tasks to identify potential wrong pseudo-labels and mitigate their confirmation bias. In the inference phase, only the trained SED model is used, thus no additional computation and storage costs are incurred. Extensive experiments on the DESED dataset demonstrate the effectiveness of our method.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3947-3959"},"PeriodicalIF":4.1,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Selective-Memory Meta-Learning With Environment Representations for Sound Event Localization and Detection 利用环境表征进行选择性记忆元学习,实现声音事件定位和检测
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-29 DOI: 10.1109/TASLP.2024.3451974
Jinbo Hu;Yin Cao;Ming Wu;Qiuqiang Kong;Feiran Yang;Mark D. Plumbley;Jun Yang
Environment shifts and conflicts present significant challenges for learning-based sound event localization and detection (SELD) methods. SELD systems, when trained in particular acoustic settings, often show restricted generalization capabilities for diverse acoustic environments. Furthermore, obtaining annotated samples for spatial sound events is notably costly. Deploying a SELD system in a new environment requires extensive time for re-training and fine-tuning. To overcome these challenges, we propose environment-adaptive Meta-SELD, designed for efficient adaptation to new environments using minimal data. Our method specifically utilizes computationally synthesized spatial data and employs Model-Agnostic Meta-Learning (MAML) on a pre-trained, environment-independent model. The method then utilizes fast adaptation to unseen real-world environments using limited samples from the respective environments. Inspired by the Learning-to-Forget approach, we introduce the concept of selective memory as a strategy for resolving conflicts across environments. This approach involves selectively memorizing target-environment-relevant information and adapting to the new environments through the selective attenuation of model parameters. In addition, we introduce environment representations to characterize different acoustic settings, enhancing the adaptability of our attenuation approach to various environments. We evaluate our proposed method on the development set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset and computationally synthesized scenes. Experimental results demonstrate the superior performance of the proposed method compared to conventional supervised learning methods, particularly in localization.
环境变化和冲突给基于学习的声音事件定位和检测(SELD)方法带来了巨大挑战。SELD 系统在特定的声学环境中经过训练后,在不同的声学环境中往往显示出有限的泛化能力。此外,获取空间声音事件注释样本的成本也很高。在新环境中部署 SELD 系统需要大量时间进行重新训练和微调。为了克服这些挑战,我们提出了环境适应性 Meta-SELD,旨在使用最少的数据高效地适应新环境。我们的方法特别利用了计算合成的空间数据,并在预先训练好的、与环境无关的模型上采用了模型诊断元学习(MAML)。然后,该方法利用来自各自环境的有限样本,快速适应未知的真实世界环境。受 "学会遗忘"(Learning-to-Forget)方法的启发,我们引入了 "选择性记忆"(selective memory)的概念,作为解决跨环境冲突的策略。这种方法包括选择性记忆目标环境相关信息,并通过选择性衰减模型参数来适应新环境。此外,我们还引入了环境表征来描述不同的声学环境,从而增强了衰减方法对各种环境的适应性。我们在索尼-TAu 真实空间声景 2023(STARSS23)数据集的开发集和计算合成场景上评估了我们提出的方法。实验结果表明,与传统的监督学习方法相比,我们提出的方法性能优越,尤其是在定位方面。
{"title":"Selective-Memory Meta-Learning With Environment Representations for Sound Event Localization and Detection","authors":"Jinbo Hu;Yin Cao;Ming Wu;Qiuqiang Kong;Feiran Yang;Mark D. Plumbley;Jun Yang","doi":"10.1109/TASLP.2024.3451974","DOIUrl":"10.1109/TASLP.2024.3451974","url":null,"abstract":"Environment shifts and conflicts present significant challenges for learning-based sound event localization and detection (SELD) methods. SELD systems, when trained in particular acoustic settings, often show restricted generalization capabilities for diverse acoustic environments. Furthermore, obtaining annotated samples for spatial sound events is notably costly. Deploying a SELD system in a new environment requires extensive time for re-training and fine-tuning. To overcome these challenges, we propose environment-adaptive Meta-SELD, designed for efficient adaptation to new environments using minimal data. Our method specifically utilizes computationally synthesized spatial data and employs Model-Agnostic Meta-Learning (MAML) on a pre-trained, environment-independent model. The method then utilizes fast adaptation to unseen real-world environments using limited samples from the respective environments. Inspired by the Learning-to-Forget approach, we introduce the concept of selective memory as a strategy for resolving conflicts across environments. This approach involves selectively memorizing target-environment-relevant information and adapting to the new environments through the selective attenuation of model parameters. In addition, we introduce environment representations to characterize different acoustic settings, enhancing the adaptability of our attenuation approach to various environments. We evaluate our proposed method on the development set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset and computationally synthesized scenes. Experimental results demonstrate the superior performance of the proposed method compared to conventional supervised learning methods, particularly in localization.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4313-4327"},"PeriodicalIF":4.1,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Binaural Beamforming Taking Into Account Spatial Release From Masking 考虑到掩码空间释放的双耳波束成形
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-29 DOI: 10.1109/TASLP.2024.3451988
Johannes W. de Vries;Steven van de Par;Geert Leus;Richard Heusdens;Richard C. Hendriks
Hearing impairment is a prevalent problem with daily challenges like impaired speech intelligibility and sound localisation. One of the shortcomings of spatial filtering in hearing aids is that speech intelligibility is often not optimised directly, meaning that different auditory processes contributing to intelligibility are often not considered. One example is the perceptual phenomenon known as spatial release from masking (SRM). This paper develops a signal model that explicitly considers SRM in the beamforming design, achieved by transforming the binaural intelligibility prediction model (BSIM) into a signal processing framework. The resulting extended signal model is used to analyse the performance of reference beamformers and design a novel beamformer that more closely considers how the auditory system perceives binaural sound. It can be shown that the binaural minimum variance distortionless response (BMVDR) beamformer is also an optimal solution for the extended, perceived model, suggesting that SRM does not play a significant role in intelligibility enhancement after optimal beamforming. However, the optimal beamformer is no longer unique in the extended signal model. The additional secondary degrees of freedom can be used to preserve binaural cues of interfering sources while still achieving the same perceived performance of the BMVDR beamformer, though with a possible high sensitivity to intelligibility model mismatch errors.
听力障碍是一个普遍存在的问题,其日常挑战包括语音清晰度受损和声音定位。助听器空间滤波技术的缺点之一是往往不能直接优化言语清晰度,这意味着往往没有考虑到不同的听觉过程对清晰度的影响。其中一个例子就是被称为 "掩蔽空间释放"(SRM)的知觉现象。本文通过将双耳可懂度预测模型(BSIM)转化为信号处理框架,开发了一种信号模型,在波束成形设计中明确考虑了 SRM。由此产生的扩展信号模型被用于分析参考波束成形器的性能,并设计出一种新型波束成形器,更贴近地考虑听觉系统如何感知双耳声音。结果表明,双耳最小方差无失真响应(BMVDR)波束成形器也是扩展的感知模型的最佳解决方案,这表明在最佳波束成形之后,SRM 在可懂度增强方面并没有发挥重要作用。然而,在扩展信号模型中,最佳波束成形器不再是唯一的。额外的二级自由度可用于保留干扰源的双耳线索,同时仍能达到与 BMVDR 波束成形器相同的感知性能,但对可懂度模型不匹配误差的敏感度可能较高。
{"title":"Binaural Beamforming Taking Into Account Spatial Release From Masking","authors":"Johannes W. de Vries;Steven van de Par;Geert Leus;Richard Heusdens;Richard C. Hendriks","doi":"10.1109/TASLP.2024.3451988","DOIUrl":"10.1109/TASLP.2024.3451988","url":null,"abstract":"Hearing impairment is a prevalent problem with daily challenges like impaired speech intelligibility and sound localisation. One of the shortcomings of spatial filtering in hearing aids is that speech intelligibility is often not optimised directly, meaning that different auditory processes contributing to intelligibility are often not considered. One example is the perceptual phenomenon known as spatial release from masking (SRM). This paper develops a signal model that explicitly considers SRM in the beamforming design, achieved by transforming the binaural intelligibility prediction model (BSIM) into a signal processing framework. The resulting extended signal model is used to analyse the performance of reference beamformers and design a novel beamformer that more closely considers how the auditory system perceives binaural sound. It can be shown that the binaural minimum variance distortionless response (BMVDR) beamformer is also an optimal solution for the extended, perceived model, suggesting that SRM does not play a significant role in intelligibility enhancement after optimal beamforming. However, the optimal beamformer is no longer unique in the extended signal model. The additional secondary degrees of freedom can be used to preserve binaural cues of interfering sources while still achieving the same perceived performance of the BMVDR beamformer, though with a possible high sensitivity to intelligibility model mismatch errors.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4002-4012"},"PeriodicalIF":4.1,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RefXVC: Cross-Lingual Voice Conversion With Enhanced Reference Leveraging RefXVC:利用增强型参考资料进行跨语言语音转换
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-28 DOI: 10.1109/TASLP.2024.3439996
Mingyang Zhang;Yi Zhou;Yi Ren;Chen Zhang;Xiang Yin;Haizhou Li
This paper proposes RefXVC, a method for cross-lingual voice conversion (XVC) that leverages reference information to improve conversion performance. Previous XVC works generally take an average speaker embedding to condition the speaker identity, which does not account for the changing timbre of speech that occurs with different pronunciations. To address this, our method uses both global and local speaker embeddings to capture the timbre changes during speech conversion. Additionally, we observed a connection between timbre and pronunciation in different languages and utilized this by incorporating a timbre encoder and a pronunciation matching network into our model. Furthermore, we found that the variation in tones is not adequately reflected in a sentence, and therefore, we used multiple references to better capture the range of a speaker's voice. The proposed method outperformed existing systems in terms of both speech quality and speaker similarity, highlighting the effectiveness of leveraging reference information in cross-lingual voice conversion.
本文提出的 RefXVC 是一种利用参考信息提高转换性能的跨语言语音转换(XVC)方法。以往的 XVC 方法一般采用平均扬声器嵌入来确定扬声器的身份,但这并不考虑不同发音时语音音色的变化。为了解决这个问题,我们的方法同时使用全局和局部扬声器嵌入来捕捉语音转换过程中的音色变化。此外,我们还观察到不同语言的音色和发音之间存在联系,并将音色编码器和发音匹配网络纳入我们的模型,从而利用了这一点。此外,我们还发现,音调的变化并不能充分反映在一个句子中,因此,我们使用了多重参考来更好地捕捉说话者的声音范围。所提出的方法在语音质量和说话人相似度方面都优于现有系统,突出了在跨语言语音转换中利用参考信息的有效性。
{"title":"RefXVC: Cross-Lingual Voice Conversion With Enhanced Reference Leveraging","authors":"Mingyang Zhang;Yi Zhou;Yi Ren;Chen Zhang;Xiang Yin;Haizhou Li","doi":"10.1109/TASLP.2024.3439996","DOIUrl":"10.1109/TASLP.2024.3439996","url":null,"abstract":"This paper proposes RefXVC, a method for cross-lingual voice conversion (XVC) that leverages reference information to improve conversion performance. Previous XVC works generally take an average speaker embedding to condition the speaker identity, which does not account for the changing timbre of speech that occurs with different pronunciations. To address this, our method uses both global and local speaker embeddings to capture the timbre changes during speech conversion. Additionally, we observed a connection between timbre and pronunciation in different languages and utilized this by incorporating a timbre encoder and a pronunciation matching network into our model. Furthermore, we found that the variation in tones is not adequately reflected in a sentence, and therefore, we used multiple references to better capture the range of a speaker's voice. The proposed method outperformed existing systems in terms of both speech quality and speaker similarity, highlighting the effectiveness of leveraging reference information in cross-lingual voice conversion.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4146-4156"},"PeriodicalIF":4.1,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1