{"title":"Reconstructing continuous language from brain signals measured by fMRI based brain-computer interface","authors":"Shurui Li, Yuanning Li, Ru-Yuan Zhang","doi":"10.1002/brx2.70001","DOIUrl":null,"url":null,"abstract":"<p>Brain-computer interfaces (BCIs) are designed to bridge the gap between human neural activity and external devices. Previous studies have shown that speech and text can be decoded from signals recorded from intracranial electrodes.<span><sup>1</sup></span> Such applications can be used to develop neuroprostheses to restore speech function in patients with brain and psychiatric disorders.<span><sup>2</sup></span> These methods largely rely on invasive intracranial neural recordings that provide signals with high spatiotemporal resolution and high signal-to-noise ratio. Despite the advantage of being non-invasive, low temporal resolution means functional magnetic resonance imaging (fMRI) has rarely been used in this context to decode continuous speech, with its application primarily limited to coarse classification tasks.<span><sup>3</sup></span></p><p>Despite this, fMRI-based neural encoding models have seen great progress in the last decade. For example, voxel-wise neural responses to continuous natural speech can be predicted using feature embeddings extracted from language models.<span><sup>4</sup></span> To reconstruct continuous speech from fMRI, three obstacles must be overcome. First, the brain's semantic representation regions are not clearly defined—previous research suggests a distributed network across various brain areas. Second, due to its temporal sluggishness, a single fMRI time point captures information from multiple preceding words within a 6–10-s window. Third, constraining the semantic space in language construction is challenging, as existing fMRI data capture only a fraction of the real semantic richness.</p><p>In a recently published study,<span><sup>5</sup></span> Tang and colleagues propose a Bayesian method to decode continuous language from brain responses measured by fMRI. Unlike previous attempts to decode semantic vectors (<i>S</i>) directly from brain responses (<i>R</i>), this study used brain responses as a control condition for language generation models. The goal was to invert the encoding model to identify the most appropriate stimulus. According to Bayesian theory, the decoder estimates the posterior distribution <i>P</i>(<i>S</i>|<i>R</i>) and finds the stimuli <i>S</i> that maximizes the posterior distribution given the neural response <i>R</i>. Instead of directly building decoders that estimate <i>P</i>(<i>S</i>|<i>R</i>), which is usually intractable due to the aforementioned difficulties, the authors took advantage of the Bayesian decoding framework that <i>P</i>(<i>S</i>|<i>R</i>) ∝ <i>P</i>(<i>S</i>)<i>P</i>(<i>R</i>|<i>S</i>) and focused instead on the encoding model <i>P</i>(<i>R</i>|<i>S</i>).</p><p>This work successfully overcame the three main barriers to fMRI-based language decoding. First, to localize the brain voxels containing semantic information, encoding performance was used as a metric to select voxels for decoding. Second, to deal with the temporal sluggishness of blood oxygen level-dependent (BOLD) signals, the semantic information for 10 s preceding each repetition time was used to build the encoding model. Third, to ensure that meaningful and readable sentences could be reconstructed, the language model GPT-1 was used to parameterize the prior distribution <i>P</i>(<i>S</i>) over the entire semantic space. GPT-1 uses an autoregressive model to predict words based on prior context, enabling natural language generation. Additionally, a beam search algorithm was used to maintain a relatively large and stable candidate pool.</p><p>We note several differences between non-invasive fMRI-based and invasive electrophysiology-based language decoding. The success of language decoding in this study is mainly due to the distributed nature of semantic representations in the brain, and the fact that semantic representations during speech perception can be reliably captured by BOLD signals. However, semantic space is highly multi-dimensional, continuous, and infinite. Invasive speech BCIs rely on electrophysiological signals with high temporal resolution from the sensorimotor cortex; finite, discrete sets of decoding targets, such as phonemes or letters, result in relatively low word error rates. Nevertheless, the semantic reconstruction approach proposed in this study is promising for decoding higher-level amodal concepts, for example, the decoding of text from silent videos, which cannot be easily achieved by invasive speech-motor BCIs.</p><p>Despite the many advantages mentioned above, this work still has some limitations. First, in the Bayesian decoding framework, the effectiveness of the decoder depends heavily on the performance of the encoding model. GPT-1 embeddings may represent only a subset of the semantic information in the brain. For example, in this work, only well-encoded voxels were used for decoding. The remaining voxels are probably also involved in semantic representation, but cannot be encoded by GPT-1 embeddings. Second, this work assumed that the total brain response is the sum of responses to semantics in previous time points. This assumption may not be consistent with the actual activation process in the brain.</p><p>Despite its limitations, this study sheds new light on non-invasive BCI techniques. We see several promising directions for BCIs in the future. First, safer, portable, and durable invasive BCIs could help thousands of patients with neurological disorders to express their thoughts. Second, cheaper, smaller non-invasive BCIs may have clinical and entertainment applications, such as in the metaverse. Finally, it is also crucial to improve the temporal resolution of non-invasive BCIs. For example, combination with electroencephalogram or magnetoencephalography data could compensate for the low temporal resolution of fMRI. With higher temporal resolution, the decoder could use both semantic and sensorimotor information to improve reconstruction accuracy.</p><p><b>Shurui Li</b>: Conceptualization; formal analysis; visualization; writing—original draft. <b>Yuanning Li</b>: Conceptualization; funding acquisition; investigation; resources; supervision; validation; visualization; writing—review and editing. <b>Ru-Yuan Zhang</b>: Conceptualization; formal analysis; funding acquisition; project administration; resources; supervision; validation; visualization; writing—original draft; writing—review and editing.</p><p>The authors declare no competing interests.</p><p>This is a commentary paper with no empirical experiment.</p>","PeriodicalId":94303,"journal":{"name":"Brain-X","volume":"2 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/brx2.70001","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Brain-X","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/brx2.70001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Brain-computer interfaces (BCIs) are designed to bridge the gap between human neural activity and external devices. Previous studies have shown that speech and text can be decoded from signals recorded from intracranial electrodes.1 Such applications can be used to develop neuroprostheses to restore speech function in patients with brain and psychiatric disorders.2 These methods largely rely on invasive intracranial neural recordings that provide signals with high spatiotemporal resolution and high signal-to-noise ratio. Despite the advantage of being non-invasive, low temporal resolution means functional magnetic resonance imaging (fMRI) has rarely been used in this context to decode continuous speech, with its application primarily limited to coarse classification tasks.3
Despite this, fMRI-based neural encoding models have seen great progress in the last decade. For example, voxel-wise neural responses to continuous natural speech can be predicted using feature embeddings extracted from language models.4 To reconstruct continuous speech from fMRI, three obstacles must be overcome. First, the brain's semantic representation regions are not clearly defined—previous research suggests a distributed network across various brain areas. Second, due to its temporal sluggishness, a single fMRI time point captures information from multiple preceding words within a 6–10-s window. Third, constraining the semantic space in language construction is challenging, as existing fMRI data capture only a fraction of the real semantic richness.
In a recently published study,5 Tang and colleagues propose a Bayesian method to decode continuous language from brain responses measured by fMRI. Unlike previous attempts to decode semantic vectors (S) directly from brain responses (R), this study used brain responses as a control condition for language generation models. The goal was to invert the encoding model to identify the most appropriate stimulus. According to Bayesian theory, the decoder estimates the posterior distribution P(S|R) and finds the stimuli S that maximizes the posterior distribution given the neural response R. Instead of directly building decoders that estimate P(S|R), which is usually intractable due to the aforementioned difficulties, the authors took advantage of the Bayesian decoding framework that P(S|R) ∝ P(S)P(R|S) and focused instead on the encoding model P(R|S).
This work successfully overcame the three main barriers to fMRI-based language decoding. First, to localize the brain voxels containing semantic information, encoding performance was used as a metric to select voxels for decoding. Second, to deal with the temporal sluggishness of blood oxygen level-dependent (BOLD) signals, the semantic information for 10 s preceding each repetition time was used to build the encoding model. Third, to ensure that meaningful and readable sentences could be reconstructed, the language model GPT-1 was used to parameterize the prior distribution P(S) over the entire semantic space. GPT-1 uses an autoregressive model to predict words based on prior context, enabling natural language generation. Additionally, a beam search algorithm was used to maintain a relatively large and stable candidate pool.
We note several differences between non-invasive fMRI-based and invasive electrophysiology-based language decoding. The success of language decoding in this study is mainly due to the distributed nature of semantic representations in the brain, and the fact that semantic representations during speech perception can be reliably captured by BOLD signals. However, semantic space is highly multi-dimensional, continuous, and infinite. Invasive speech BCIs rely on electrophysiological signals with high temporal resolution from the sensorimotor cortex; finite, discrete sets of decoding targets, such as phonemes or letters, result in relatively low word error rates. Nevertheless, the semantic reconstruction approach proposed in this study is promising for decoding higher-level amodal concepts, for example, the decoding of text from silent videos, which cannot be easily achieved by invasive speech-motor BCIs.
Despite the many advantages mentioned above, this work still has some limitations. First, in the Bayesian decoding framework, the effectiveness of the decoder depends heavily on the performance of the encoding model. GPT-1 embeddings may represent only a subset of the semantic information in the brain. For example, in this work, only well-encoded voxels were used for decoding. The remaining voxels are probably also involved in semantic representation, but cannot be encoded by GPT-1 embeddings. Second, this work assumed that the total brain response is the sum of responses to semantics in previous time points. This assumption may not be consistent with the actual activation process in the brain.
Despite its limitations, this study sheds new light on non-invasive BCI techniques. We see several promising directions for BCIs in the future. First, safer, portable, and durable invasive BCIs could help thousands of patients with neurological disorders to express their thoughts. Second, cheaper, smaller non-invasive BCIs may have clinical and entertainment applications, such as in the metaverse. Finally, it is also crucial to improve the temporal resolution of non-invasive BCIs. For example, combination with electroencephalogram or magnetoencephalography data could compensate for the low temporal resolution of fMRI. With higher temporal resolution, the decoder could use both semantic and sensorimotor information to improve reconstruction accuracy.