Zexu Pan;Marvin Borsdorf;Siqi Cai;Tanja Schultz;Haizhou Li
{"title":"NeuroHeed:使用脑电信号的神经分层扬声器提取技术","authors":"Zexu Pan;Marvin Borsdorf;Siqi Cai;Tanja Schultz;Haizhou Li","doi":"10.1109/TASLP.2024.3463498","DOIUrl":null,"url":null,"abstract":"Humans possess the remarkable ability to selectively attend to a single speaker amidst competing voices and background noise, known as \n<italic>selective auditory attention</i>\n. Recent studies in auditory neuroscience indicate a strong correlation between the attended speech signal and the corresponding brain's elicited neuronal activities. In this work, we study such brain activities measured using affordable and non-intrusive electroencephalography (EEG) devices. We present NeuroHeed, a speaker extraction model that leverages the listener's synchronized EEG signals to extract the attended speech signal in a cocktail party scenario, in which the extraction process is conditioned on a neuronal attractor encoded from the EEG signal. We propose both an offline and an online NeuroHeed, with the latter designed for real-time inference. In the online NeuroHeed, we additionally propose an autoregressive speaker encoder, which accumulates past extracted speech signals for self-enrollment of the attended speaker information into an auditory attractor, that retains the attentional momentum over time. Online NeuroHeed extracts the current window of the speech signals with guidance from both attractors. Experimental results on KUL dataset two-speaker scenario demonstrate that NeuroHeed effectively extracts brain-attended speech signals with an average scale-invariant signal-to-noise ratio improvement (SI-SDRi) of 14.3 dB and extraction accuracy of 90.8% in offline settings, and SI-SDRi of 11.2 dB and extraction accuracy of 85.1% in online settings.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4456-4470"},"PeriodicalIF":4.1000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10683957","citationCount":"0","resultStr":"{\"title\":\"NeuroHeed: Neuro-Steered Speaker Extraction Using EEG Signals\",\"authors\":\"Zexu Pan;Marvin Borsdorf;Siqi Cai;Tanja Schultz;Haizhou Li\",\"doi\":\"10.1109/TASLP.2024.3463498\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Humans possess the remarkable ability to selectively attend to a single speaker amidst competing voices and background noise, known as \\n<italic>selective auditory attention</i>\\n. Recent studies in auditory neuroscience indicate a strong correlation between the attended speech signal and the corresponding brain's elicited neuronal activities. In this work, we study such brain activities measured using affordable and non-intrusive electroencephalography (EEG) devices. We present NeuroHeed, a speaker extraction model that leverages the listener's synchronized EEG signals to extract the attended speech signal in a cocktail party scenario, in which the extraction process is conditioned on a neuronal attractor encoded from the EEG signal. We propose both an offline and an online NeuroHeed, with the latter designed for real-time inference. In the online NeuroHeed, we additionally propose an autoregressive speaker encoder, which accumulates past extracted speech signals for self-enrollment of the attended speaker information into an auditory attractor, that retains the attentional momentum over time. Online NeuroHeed extracts the current window of the speech signals with guidance from both attractors. Experimental results on KUL dataset two-speaker scenario demonstrate that NeuroHeed effectively extracts brain-attended speech signals with an average scale-invariant signal-to-noise ratio improvement (SI-SDRi) of 14.3 dB and extraction accuracy of 90.8% in offline settings, and SI-SDRi of 11.2 dB and extraction accuracy of 85.1% in online settings.\",\"PeriodicalId\":13332,\"journal\":{\"name\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"volume\":\"32 \",\"pages\":\"4456-4470\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10683957\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10683957/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10683957/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
NeuroHeed: Neuro-Steered Speaker Extraction Using EEG Signals
Humans possess the remarkable ability to selectively attend to a single speaker amidst competing voices and background noise, known as
selective auditory attention
. Recent studies in auditory neuroscience indicate a strong correlation between the attended speech signal and the corresponding brain's elicited neuronal activities. In this work, we study such brain activities measured using affordable and non-intrusive electroencephalography (EEG) devices. We present NeuroHeed, a speaker extraction model that leverages the listener's synchronized EEG signals to extract the attended speech signal in a cocktail party scenario, in which the extraction process is conditioned on a neuronal attractor encoded from the EEG signal. We propose both an offline and an online NeuroHeed, with the latter designed for real-time inference. In the online NeuroHeed, we additionally propose an autoregressive speaker encoder, which accumulates past extracted speech signals for self-enrollment of the attended speaker information into an auditory attractor, that retains the attentional momentum over time. Online NeuroHeed extracts the current window of the speech signals with guidance from both attractors. Experimental results on KUL dataset two-speaker scenario demonstrate that NeuroHeed effectively extracts brain-attended speech signals with an average scale-invariant signal-to-noise ratio improvement (SI-SDRi) of 14.3 dB and extraction accuracy of 90.8% in offline settings, and SI-SDRi of 11.2 dB and extraction accuracy of 85.1% in online settings.
期刊介绍:
The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.