This paper introduces a novel representation of Arabic text as an alternative approach for Arabic NLP, inspired by the dotless script of ancient Arabic. We explored this representation through extensive analysis on various text corpora, differing in size and domain, and tokenized using multiple tokenization techniques. Furthermore, we examined the information density of this representation and compared it with the standard dotted Arabic text using text entropy analysis. Utilizing parallel corpora, we also drew comparisons between Arabic and English text analysis to gain additional insights. Our investigation extended to various upstream and downstream NLP tasks, including language modeling, text classification, sequence labeling, and machine translation, examining the implications of both the representations. Specifically, we performed seven different downstream tasks using various tokenization schemes comparing the standard dotted text with dotless Arabic text representations. The performances using both the representations were comparable across different tokenizations. However, dotless representation achieves these results with significant reduction in vocabulary sizes, and in some scenarios showing reduction of up to 50%. Additionally, we present a system that restores dots to the dotless Arabic text. This system is useful for tasks that require Arabic texts as output.
{"title":"Dotless Arabic text for Natural Language Processing","authors":"Maged S. Al-Shaibani, Irfan Ahmad","doi":"10.1162/coli_a_00535","DOIUrl":"https://doi.org/10.1162/coli_a_00535","url":null,"abstract":"This paper introduces a novel representation of Arabic text as an alternative approach for Arabic NLP, inspired by the dotless script of ancient Arabic. We explored this representation through extensive analysis on various text corpora, differing in size and domain, and tokenized using multiple tokenization techniques. Furthermore, we examined the information density of this representation and compared it with the standard dotted Arabic text using text entropy analysis. Utilizing parallel corpora, we also drew comparisons between Arabic and English text analysis to gain additional insights. Our investigation extended to various upstream and downstream NLP tasks, including language modeling, text classification, sequence labeling, and machine translation, examining the implications of both the representations. Specifically, we performed seven different downstream tasks using various tokenization schemes comparing the standard dotted text with dotless Arabic text representations. The performances using both the representations were comparable across different tokenizations. However, dotless representation achieves these results with significant reduction in vocabulary sizes, and in some scenarios showing reduction of up to 50%. Additionally, we present a system that restores dots to the dotless Arabic text. This system is useful for tasks that require Arabic texts as output.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"311 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Humans acquire their native languages by taking part in communicative interactions with their caregivers. These interactions are meaningful, intentional, and situated in their everyday environment. The situated and communicative nature of the interactions is essential to the language acquisition process, as language learners depend on clues provided by the communicative environment to make sense of the utterances they perceive. As such, the linguistic knowledge they build up is rooted in linguistic forms, their meaning, and their communicative function. When it comes to machines, the situated, communicative, and interactional aspects of language learning are often passed over. This applies in particular to today’s large language models (LLMs), where the input is predominantly text-based, and where the distribution of character groups or words serves as a basis for modeling the meaning of linguistic expressions. In this article, we argue that this design choice lies at the root of a number of important limitations, in particular regarding the data hungriness of the models, their limited ability to perform human-like logical and pragmatic reasoning, and their susceptibility to biases. At the same time, we make a case for an alternative approach that models how artificial agents can acquire linguistic structures by participating in situated communicative interactions. Through a selection of experiments, we show how the linguistic knowledge that is captured in the resulting models is of a fundamentally different nature than the knowledge captured by LLMs and argue that this change of perspective provides a promising path towards more human-like language processing in machines.
{"title":"Humans Learn Language from Situated Communicative Interactions. What about Machines?","authors":"Katrien Beuls, Paul Van Eecke","doi":"10.1162/coli_a_00534","DOIUrl":"https://doi.org/10.1162/coli_a_00534","url":null,"abstract":"Humans acquire their native languages by taking part in communicative interactions with their caregivers. These interactions are meaningful, intentional, and situated in their everyday environment. The situated and communicative nature of the interactions is essential to the language acquisition process, as language learners depend on clues provided by the communicative environment to make sense of the utterances they perceive. As such, the linguistic knowledge they build up is rooted in linguistic forms, their meaning, and their communicative function. When it comes to machines, the situated, communicative, and interactional aspects of language learning are often passed over. This applies in particular to today’s large language models (LLMs), where the input is predominantly text-based, and where the distribution of character groups or words serves as a basis for modeling the meaning of linguistic expressions. In this article, we argue that this design choice lies at the root of a number of important limitations, in particular regarding the data hungriness of the models, their limited ability to perform human-like logical and pragmatic reasoning, and their susceptibility to biases. At the same time, we make a case for an alternative approach that models how artificial agents can acquire linguistic structures by participating in situated communicative interactions. Through a selection of experiments, we show how the linguistic knowledge that is captured in the resulting models is of a fundamentally different nature than the knowledge captured by LLMs and argue that this change of perspective provides a promising path towards more human-like language processing in machines.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"27 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large language models (LLMs) have garnered a great deal of attention for their exceptional generative performance on commonsense and reasoning tasks. In this work, we investigate LLMs’ capabilities for generalization using a particularly challenging type of statement: generics. Generics express generalizations (e.g., birds can fly) but do so without explicit quantification. They are notable because they generalize over their instantiations (e.g., sparrows can fly) yet hold true even in the presence of exceptions (e.g., penguins do not). For humans, these generic generalization play a fundamental role in cognition, concept acquisition, and intuitive reasoning. We investigate how LLMs respond to and reason about generics. To this end, we first propose a framework grounded in pragmatics to automatically generate both exceptions and instantiations – collectively exemplars. We make use of focus – a pragmatic phenomenon that highlights meaning-bearing elements in a sentence – to capture the full range of interpretations of generics across different contexts of use. This allows us to derive precise logical definitions for exemplars and operationalize them to automatically generate exemplars from LLMs. Using our system, we generate a dataset of ∼370k exemplars across ∼17k generics and conduct a human validation of a sample of the generated data. We use our final generated dataset to investigate how LLMs’ reason about generics. Humans have a documented tendency to conflate universally quantified statements (e.g., all birds can fly) with generics. Therefore, we probe whether LLMs exhibit similar overgeneralization behavior in terms of quantification and in property inheritance. We find that LLMs do show evidence of overgeneralization, although they sometimes struggle to reason about exceptions. Furthermore, we find that LLMs may exhibit similar non-logical behavior to humans when considering property inheritance from generics.
{"title":"Exceptions, Instantiations, and Overgeneralization: Insights into How Language Models Process Generics","authors":"Emily Allaway, Chandra Bhagavatula, Jena D. Hwang, Kathleen McKeown, Sarah-Jane Leslie","doi":"10.1162/coli_a_00530","DOIUrl":"https://doi.org/10.1162/coli_a_00530","url":null,"abstract":"Large language models (LLMs) have garnered a great deal of attention for their exceptional generative performance on commonsense and reasoning tasks. In this work, we investigate LLMs’ capabilities for generalization using a particularly challenging type of statement: generics. Generics express generalizations (e.g., birds can fly) but do so without explicit quantification. They are notable because they generalize over their instantiations (e.g., sparrows can fly) yet hold true even in the presence of exceptions (e.g., penguins do not). For humans, these generic generalization play a fundamental role in cognition, concept acquisition, and intuitive reasoning. We investigate how LLMs respond to and reason about generics. To this end, we first propose a framework grounded in pragmatics to automatically generate both exceptions and instantiations – collectively exemplars. We make use of focus – a pragmatic phenomenon that highlights meaning-bearing elements in a sentence – to capture the full range of interpretations of generics across different contexts of use. This allows us to derive precise logical definitions for exemplars and operationalize them to automatically generate exemplars from LLMs. Using our system, we generate a dataset of ∼370k exemplars across ∼17k generics and conduct a human validation of a sample of the generated data. We use our final generated dataset to investigate how LLMs’ reason about generics. Humans have a documented tendency to conflate universally quantified statements (e.g., all birds can fly) with generics. Therefore, we probe whether LLMs exhibit similar overgeneralization behavior in terms of quantification and in property inheritance. We find that LLMs do show evidence of overgeneralization, although they sometimes struggle to reason about exceptions. Furthermore, we find that LLMs may exhibit similar non-logical behavior to humans when considering property inheritance from generics.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"7 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study explores the cognitive mechanisms underlying human language acquisition through grammar induction by a minimal cognitive architecture, with a short and flexible sequence memory as its most central feature. We use reinforcement learning for the task of identifying sentences in a stream of words from artificial languages. Results demonstrate the model’s ability to identify frequent and informative multi-word chunks, reproducing characteristics of natural language acquisition. The model successfully navigates varying degrees of linguistic complexity, exposing efficient adaptation to combinatorial challenges through the reuse of sequential patterns. The emergence of parsimonious tree structures suggests an optimization for the sentence identification task, balancing economy and information. The cognitive architecture reflects aspects of human memory systems and decision-making processes, enhancing its cognitive plausibility. While the model exhibits limitations in generalization and semantic representation, its minimalist nature offers insights into some fundamental mechanisms of language learning. Our study demonstrates the power of this simple architecture and stresses the importance of sequence memory in language learning. Since other animals do not seem to have faithful sequence memory, this may be a key to understanding why only humans have developed complex languages.
{"title":"Usage-based grammar induction from minimal cognitive principles","authors":"Anna Jon-And, Jérôme Michaud","doi":"10.1162/coli_a_00528","DOIUrl":"https://doi.org/10.1162/coli_a_00528","url":null,"abstract":"This study explores the cognitive mechanisms underlying human language acquisition through grammar induction by a minimal cognitive architecture, with a short and flexible sequence memory as its most central feature. We use reinforcement learning for the task of identifying sentences in a stream of words from artificial languages. Results demonstrate the model’s ability to identify frequent and informative multi-word chunks, reproducing characteristics of natural language acquisition. The model successfully navigates varying degrees of linguistic complexity, exposing efficient adaptation to combinatorial challenges through the reuse of sequential patterns. The emergence of parsimonious tree structures suggests an optimization for the sentence identification task, balancing economy and information. The cognitive architecture reflects aspects of human memory systems and decision-making processes, enhancing its cognitive plausibility. While the model exhibits limitations in generalization and semantic representation, its minimalist nature offers insights into some fundamental mechanisms of language learning. Our study demonstrates the power of this simple architecture and stresses the importance of sequence memory in language learning. Since other animals do not seem to have faithful sequence memory, this may be a key to understanding why only humans have developed complex languages.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"74 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what “understanding” means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes — inspired by Fregean senses — of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model’s multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.
{"title":"From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency","authors":"Xenia Ohmer, Elia Bruni, Dieuwke Hupkes","doi":"10.1162/coli_a_00529","DOIUrl":"https://doi.org/10.1162/coli_a_00529","url":null,"abstract":"The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what “understanding” means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes — inspired by Fregean senses — of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model’s multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"73 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The brain’s ability to perform complex computations at varying timescales is crucial, ranging from understanding single words to grasping the overarching narrative of a story. Recently, multi-timescale long short-term memory (MT-LSTM) models (Mahto et al. 2020; Jain et al. 2020) have been introduced, which use temporally-tuned parameters to induce sensitivity to different timescales of language processing (i.e. related to near/distant words). However, there has not been an exploration of the relation between such temporally-tuned information processing in MT-LSTMs and the brain’s language processing using high temporal resolution recording modalities, such as electroencephalography (EEG). To bridge this gap, we used an EEG dataset recorded while participants listened to Chapter 1 of “Alice in Wonderland” and trained ridge regression models to predict the temporally-tuned MT-LSTM embeddings from EEG responses. Our analysis reveals that EEG signals can be used to predict MT-LSTM embeddings across various timescales. For longer timescales, our models produced accurate predictions within an extended time window of ±2 s around word onset, while for shorter timescales, significant predictions are confined to a narrow window ranging from −180 ms to 790 ms. Intriguingly, we observed that short timescale information is not only processed in the vicinity of word onset but also at distant time points. These observations underscore the parallels and discrepancies between computational models and the neural mechanisms of the brain. As word embeddings are used more as in silico models of semantic representation in the brain, a more explicit consideration of timescale-dependent processing enables more targeted explorations of language processing in humans and machines.
{"title":"Exploring temporal sensitivity in the brain using multi-timescale language models: an EEG decoding study","authors":"Sijie Ling, Alex Murphy, Alona Fyshe","doi":"10.1162/coli_a_00533","DOIUrl":"https://doi.org/10.1162/coli_a_00533","url":null,"abstract":"The brain’s ability to perform complex computations at varying timescales is crucial, ranging from understanding single words to grasping the overarching narrative of a story. Recently, multi-timescale long short-term memory (MT-LSTM) models (Mahto et al. 2020; Jain et al. 2020) have been introduced, which use temporally-tuned parameters to induce sensitivity to different timescales of language processing (i.e. related to near/distant words). However, there has not been an exploration of the relation between such temporally-tuned information processing in MT-LSTMs and the brain’s language processing using high temporal resolution recording modalities, such as electroencephalography (EEG). To bridge this gap, we used an EEG dataset recorded while participants listened to Chapter 1 of “Alice in Wonderland” and trained ridge regression models to predict the temporally-tuned MT-LSTM embeddings from EEG responses. Our analysis reveals that EEG signals can be used to predict MT-LSTM embeddings across various timescales. For longer timescales, our models produced accurate predictions within an extended time window of ±2 s around word onset, while for shorter timescales, significant predictions are confined to a narrow window ranging from −180 ms to 790 ms. Intriguingly, we observed that short timescale information is not only processed in the vicinity of word onset but also at distant time points. These observations underscore the parallels and discrepancies between computational models and the neural mechanisms of the brain. As word embeddings are used more as in silico models of semantic representation in the brain, a more explicit consideration of timescale-dependent processing enables more targeted explorations of language processing in humans and machines.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"76 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Charlotte Pouw, Marianne de Heer Kloots, Afra Alishahi, Willem Zuidema
Human listeners effortlessly compensate for phonological changes during speech perception, often unconsciously inferring the intended sounds. For example, listeners infer the underlying /n/ when hearing an utterance such as “clea[m] pan”, where [m] arises from place assimilation to the following labial [p]. This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds, and identifies the linguistic knowledge that is implemented by the model to compensate for assimilation during Automatic Speech Recognition (ASR). Using psycholinguistic stimuli, we systematically analyze how various linguistic context cues influence compensation patterns in the model’s output. Complementing these behavioral experiments, our probing experiments indicate that the model shifts its interpretation of assimilated sounds from their acoustic form to their underlying form in its final layers. Finally, our causal intervention experiments suggest that the model relies on minimal phonological context cues to accomplish this shift. These findings represent a step towards better understanding the similarities and differences in phonological processing between neural ASR models and humans.
在语音感知过程中,人类听者会毫不费力地对语音变化进行补偿,往往会不自觉地推断出想要表达的声音。例如,听者在听到 "clea[m] pan "这样的语音时会推断出潜在的/n/,其中的[m]是通过与后面的唇音[p]同化而产生的。本文探讨了神经语音识别模型 Wav2Vec2 如何感知同化音,并确定了该模型在自动语音识别(ASR)过程中用于补偿同化的语言知识。利用心理语言刺激,我们系统分析了各种语言语境线索如何影响模型输出中的补偿模式。与这些行为实验相辅相成的是,我们的探测实验表明,该模型在其最终层中对同化声音的解释从声音形式转向了底层形式。最后,我们的因果干预实验表明,该模型依赖于最小的语音语境线索来完成这一转变。这些发现为更好地理解神经 ASR 模型与人类在语音处理方面的异同迈出了一步。
{"title":"Perception of Phonological Assimilation by Neural Speech Recognition Models","authors":"Charlotte Pouw, Marianne de Heer Kloots, Afra Alishahi, Willem Zuidema","doi":"10.1162/coli_a_00526","DOIUrl":"https://doi.org/10.1162/coli_a_00526","url":null,"abstract":"Human listeners effortlessly compensate for phonological changes during speech perception, often unconsciously inferring the intended sounds. For example, listeners infer the underlying /n/ when hearing an utterance such as “clea[m] pan”, where [m] arises from place assimilation to the following labial [p]. This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds, and identifies the linguistic knowledge that is implemented by the model to compensate for assimilation during Automatic Speech Recognition (ASR). Using psycholinguistic stimuli, we systematically analyze how various linguistic context cues influence compensation patterns in the model’s output. Complementing these behavioral experiments, our probing experiments indicate that the model shifts its interpretation of assimilated sounds from their acoustic form to their underlying form in its final layers. Finally, our causal intervention experiments suggest that the model relies on minimal phonological context cues to accomplish this shift. These findings represent a step towards better understanding the similarities and differences in phonological processing between neural ASR models and humans.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"41 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marc-Antoine Georges, Marvin Lavechin, Jean-Luc Schwartz, Thomas Hueber
Speech learning encompasses mastering a complex motor system to produce speech sounds from articulatory gestures while simultaneously uncovering discrete units that provide entry to the linguistic system. Remarkably, children acquire these associations between speech sounds, articulatory gestures, and linguistic units in a weakly supervised manner, without the need for explicit labeling of auditory inputs or access to target articulatory gestures. This study uses self-supervised deep learning to investigate the respective roles of sounds, gestures, and linguistic units in speech acquisition and control. In a first experiment, we analysed the quantized representations learned by vector-quantized variational autoencoders (VQ-VAE) from ground truth acoustic and articulatory data using ABX tests. We show an interesting complementarity between acoustic and articulatory modalities that may help in the discovery of phonemes. In a second experiment, we introduce a computational agent that repeats auditory speech inputs by controlling a virtual vocal apparatus. This agent integrates an articulatory synthesizer capable of reproducing diverse speech stimuli from interpretable parameters, along with two internal models implementing the articulatory-to-acoustic (forward) and acoustic-to-articulatory (inverse) mapping, respectively. Additionally, two inductive biases are used to regularize the ill-posed acoustic-to-articulatory inverse mapping. In line with the first experiment, we explore the complementarity between the auditory input and the articulatory parameters inferred by the agent. We also evaluate the impact of discretizing auditory inputs using VQ-VAE. While the majority of the agent’s productions are intelligible (according to perceptual evaluations), our analysis highlights inconsistencies in the underlying articulatory trajectories. In particular, we show that the agent’s productions only partially reproduce the complementarity between the auditory and articulatory modalities observed in humans.
{"title":"Decode, move and speak! Self-supervised learning of speech units, gestures, and sounds relationships using vocal imitation","authors":"Marc-Antoine Georges, Marvin Lavechin, Jean-Luc Schwartz, Thomas Hueber","doi":"10.1162/coli_a_00532","DOIUrl":"https://doi.org/10.1162/coli_a_00532","url":null,"abstract":"Speech learning encompasses mastering a complex motor system to produce speech sounds from articulatory gestures while simultaneously uncovering discrete units that provide entry to the linguistic system. Remarkably, children acquire these associations between speech sounds, articulatory gestures, and linguistic units in a weakly supervised manner, without the need for explicit labeling of auditory inputs or access to target articulatory gestures. This study uses self-supervised deep learning to investigate the respective roles of sounds, gestures, and linguistic units in speech acquisition and control. In a first experiment, we analysed the quantized representations learned by vector-quantized variational autoencoders (VQ-VAE) from ground truth acoustic and articulatory data using ABX tests. We show an interesting complementarity between acoustic and articulatory modalities that may help in the discovery of phonemes. In a second experiment, we introduce a computational agent that repeats auditory speech inputs by controlling a virtual vocal apparatus. This agent integrates an articulatory synthesizer capable of reproducing diverse speech stimuli from interpretable parameters, along with two internal models implementing the articulatory-to-acoustic (forward) and acoustic-to-articulatory (inverse) mapping, respectively. Additionally, two inductive biases are used to regularize the ill-posed acoustic-to-articulatory inverse mapping. In line with the first experiment, we explore the complementarity between the auditory input and the articulatory parameters inferred by the agent. We also evaluate the impact of discretizing auditory inputs using VQ-VAE. While the majority of the agent’s productions are intelligible (according to perceptual evaluations), our analysis highlights inconsistencies in the underlying articulatory trajectories. In particular, we show that the agent’s productions only partially reproduce the complementarity between the auditory and articulatory modalities observed in humans.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"184 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
How should we compare the capabilities of language models (LMs) and humans? In this paper, I draw inspiration from comparative psychology to highlight challenges in these comparisons. I focus on a case study: processing of recursively nested grammatical structures. Prior work suggests that LMs cannot process these structures as reliably as humans can. However, the humans were provided with instructions and substantial training, while the LMs were evaluated zero-shot. I therefore match the evaluation more closely. Providing large LMs with a simple prompt—with substantially less content than the human training—allows the LMs to consistently outperform the human results, even in more deeply nested conditions than were tested with humans. Furthermore, the effects of prompting are robust to the particular structures and vocabulary used in the prompt. Finally, reanalyzing the existing human data suggests that the humans may not perform above chance at the difficult structures initially. Thus, large LMs may indeed process recursively nested grammatical structures as reliably as humans, when evaluated comparably. This case study highlights how discrepancies in the evaluation methods can confound comparisons of language models and humans. I conclude by reflecting on the broader challenge of comparing human and model capabilities, and highlight an important difference between evaluating cognitive models and foundation models.
{"title":"Can language models handle recursively nested grammatical structures? A case study on comparing models and humans","authors":"Andrew Lampinen","doi":"10.1162/coli_a_00525","DOIUrl":"https://doi.org/10.1162/coli_a_00525","url":null,"abstract":"How should we compare the capabilities of language models (LMs) and humans? In this paper, I draw inspiration from comparative psychology to highlight challenges in these comparisons. I focus on a case study: processing of recursively nested grammatical structures. Prior work suggests that LMs cannot process these structures as reliably as humans can. However, the humans were provided with instructions and substantial training, while the LMs were evaluated zero-shot. I therefore match the evaluation more closely. Providing large LMs with a simple prompt—with substantially less content than the human training—allows the LMs to consistently outperform the human results, even in more deeply nested conditions than were tested with humans. Furthermore, the effects of prompting are robust to the particular structures and vocabulary used in the prompt. Finally, reanalyzing the existing human data suggests that the humans may not perform above chance at the difficult structures initially. Thus, large LMs may indeed process recursively nested grammatical structures as reliably as humans, when evaluated comparably. This case study highlights how discrepancies in the evaluation methods can confound comparisons of language models and humans. I conclude by reflecting on the broader challenge of comparing human and model capabilities, and highlight an important difference between evaluating cognitive models and foundation models.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"74 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Gregor de Varda, Daniele Gatti, Marco Marelli, Fritz Günther
Pseudowords such as “knackets” or “spechy” – letter strings that are consistent with the orthotactical rules of a language but do not appear in its lexicon – are traditionally considered to be meaningless, and employed as such in empirical studies. However, recent studies that show specific semantic patterns associated with these words as well as semantic effects on human pseudoword processing have cast doubt on this view. While these studies suggest that pseudowords have meanings, they provide only extremely limited insight as to whether humans are able to ascribe explicit and declarative semantic content to unfamiliar word forms. In the present study, we employed an exploratory-confirmatory study design to examine this question. In a first exploratory study, we started from a pre-existing dataset of words and pseudowords alongside human-generated definitions for these items. Employing 18 different language models, we showed that the definitions actually produced for (pseudo)words were closer to their respective (pseudo)words than the definitions for the other items. Based on these initial results, we conducted a second, pre-registered, high-powered confirmatory study collecting a new, controlled set of (pseudo)word interpretations. This second study confirmed the results of the first one. Taken together, these findings support the idea that meaning construction is supported by a flexible form-to-meaning mapping system based on statistical regularities in the language environment that can accommodate novel lexical entries as soon as they are encountered.
{"title":"Meaning beyond lexicality: Capturing Pseudoword Definitions with Language Models","authors":"Andrea Gregor de Varda, Daniele Gatti, Marco Marelli, Fritz Günther","doi":"10.1162/coli_a_00527","DOIUrl":"https://doi.org/10.1162/coli_a_00527","url":null,"abstract":"Pseudowords such as “knackets” or “spechy” – letter strings that are consistent with the orthotactical rules of a language but do not appear in its lexicon – are traditionally considered to be meaningless, and employed as such in empirical studies. However, recent studies that show specific semantic patterns associated with these words as well as semantic effects on human pseudoword processing have cast doubt on this view. While these studies suggest that pseudowords have meanings, they provide only extremely limited insight as to whether humans are able to ascribe explicit and declarative semantic content to unfamiliar word forms. In the present study, we employed an exploratory-confirmatory study design to examine this question. In a first exploratory study, we started from a pre-existing dataset of words and pseudowords alongside human-generated definitions for these items. Employing 18 different language models, we showed that the definitions actually produced for (pseudo)words were closer to their respective (pseudo)words than the definitions for the other items. Based on these initial results, we conducted a second, pre-registered, high-powered confirmatory study collecting a new, controlled set of (pseudo)word interpretations. This second study confirmed the results of the first one. Taken together, these findings support the idea that meaning construction is supported by a flexible form-to-meaning mapping system based on statistical regularities in the language environment that can accommodate novel lexical entries as soon as they are encountered.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"55 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}