Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world—for failing to solve the “symbol grounding problem.” Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities—and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through “embodied simulation,” the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM’s lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture—despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.
{"title":"Do Multimodal Large Language Models and Humans Ground Language Similarly?","authors":"Cameron Jones, Benjamin Bergen, Sean Trott","doi":"10.1162/coli_a_00531","DOIUrl":"https://doi.org/10.1162/coli_a_00531","url":null,"abstract":"Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world—for failing to solve the “symbol grounding problem.” Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities—and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through “embodied simulation,” the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM’s lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture—despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"28 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While summarization has been extensively researched in natural language processing (NLP), cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding. This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation. We (1) build the first CLCTS corpus with 328 (+127) instances for hDe-En and 289 (+212) for hEn-De, leveraging historical fiction texts and Wikipedia summaries in English and German; (2) examine the effectiveness of popular transformer end-to-end models with different intermediate finetuning tasks; (3) explore the potential of GPT-3.5 as a summarizer; (4) report evaluations from humans, GPT-4, and several recent automatic evaluation metrics. Our results indicate that intermediate task finetuned end-to-end models generate bad to moderate quality summaries while GPT-3.5, as a zero-shot summarizer, provides moderate to good quality outputs. GPT-3.5 also seems very adept at normalizing historical text. To assess data contamination in GPT-3.5, we design an adversarial attack scheme in which we find that GPT-3.5 performs slightly worse for unseen source documents compared to seen documents. Moreover, it sometimes hallucinates when the source sentences are inverted against its prior knowledge with a summarization accuracy of 0.67 for plot omission, 0.71 for entity swap, and 0.53 for plot negation. Overall, our regression results of model performances suggest that longer, older, and more complex source texts (all of which are more characteristic for historical language variants) are harder to summarize for all models, indicating the difficulty of the CLCTS task. Regarding evaluation, we observe that both GPT-4 and BERTScore correlate moderately with human evaluations but GPT-4 is prone to giving lower scores.
{"title":"Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation","authors":"Ran Zhang, Jihed Ouni, Steffen Eger","doi":"10.1162/coli_a_00519","DOIUrl":"https://doi.org/10.1162/coli_a_00519","url":null,"abstract":"While summarization has been extensively researched in natural language processing (NLP), cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding. This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation. We (1) build the first CLCTS corpus with 328 (+127) instances for hDe-En and 289 (+212) for hEn-De, leveraging historical fiction texts and Wikipedia summaries in English and German; (2) examine the effectiveness of popular transformer end-to-end models with different intermediate finetuning tasks; (3) explore the potential of GPT-3.5 as a summarizer; (4) report evaluations from humans, GPT-4, and several recent automatic evaluation metrics. Our results indicate that intermediate task finetuned end-to-end models generate bad to moderate quality summaries while GPT-3.5, as a zero-shot summarizer, provides moderate to good quality outputs. GPT-3.5 also seems very adept at normalizing historical text. To assess data contamination in GPT-3.5, we design an adversarial attack scheme in which we find that GPT-3.5 performs slightly worse for unseen source documents compared to seen documents. Moreover, it sometimes hallucinates when the source sentences are inverted against its prior knowledge with a summarization accuracy of 0.67 for plot omission, 0.71 for entity swap, and 0.53 for plot negation. Overall, our regression results of model performances suggest that longer, older, and more complex source texts (all of which are more characteristic for historical language variants) are harder to summarize for all models, indicating the difficulty of the CLCTS task. Regarding evaluation, we observe that both GPT-4 and BERTScore correlate moderately with human evaluations but GPT-4 is prone to giving lower scores.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"44 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141059917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automated coherence metrics constitute an efficient and popular way to evaluate topic models. Previous works present a mixed picture of their presumed correlation with human judgment. This work proposes a novel sampling approach to mine topic representations at a large-scale while seeking to mitigate bias from sampling, enabling the investigation of widely-used automated coherence metrics via large corpora. Additionally, this article proposes a novel user study design, an amalgamation of different proxy tasks, to derive a finer insight into the human decision-making processes. This design subsumes the purpose of simple rating and outlier-detection user studies. Similar to the sampling approach, the user study conducted is very extensive, comprising forty study participants split into eight different study groups tasked with evaluating their respective set of one hundred topic representations. Usually, when substantiating the use of these metrics, human responses are treated as the golden standard. This article further investigates the reliability of human judgment by flipping the comparison and conducting a novel extended analysis of human response at the group and individual level against a generic corpus. The investigation results show a moderate to good correlation between these metrics and human judgment, especially for generic corpora, and derive further insights into the human perception of coherence. Analysing inter-metric correlations across corpora shows moderate to good correlation amongst these metrics. As these metrics depend on corpus statistics, this article further investigates the topical differences between corpora revealing nuances in applications of these metrics.
{"title":"Aligning Human and Computational Coherence Evaluations","authors":"Jia Peng Lim, Hady W. Lauw","doi":"10.1162/coli_a_00518","DOIUrl":"https://doi.org/10.1162/coli_a_00518","url":null,"abstract":"Automated coherence metrics constitute an efficient and popular way to evaluate topic models. Previous works present a mixed picture of their presumed correlation with human judgment. This work proposes a novel sampling approach to mine topic representations at a large-scale while seeking to mitigate bias from sampling, enabling the investigation of widely-used automated coherence metrics via large corpora. Additionally, this article proposes a novel user study design, an amalgamation of different proxy tasks, to derive a finer insight into the human decision-making processes. This design subsumes the purpose of simple rating and outlier-detection user studies. Similar to the sampling approach, the user study conducted is very extensive, comprising forty study participants split into eight different study groups tasked with evaluating their respective set of one hundred topic representations. Usually, when substantiating the use of these metrics, human responses are treated as the golden standard. This article further investigates the reliability of human judgment by flipping the comparison and conducting a novel extended analysis of human response at the group and individual level against a generic corpus. The investigation results show a moderate to good correlation between these metrics and human judgment, especially for generic corpora, and derive further insights into the human perception of coherence. Analysing inter-metric correlations across corpora shows moderate to good correlation amongst these metrics. As these metrics depend on corpus statistics, this article further investigates the topical differences between corpora revealing nuances in applications of these metrics.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"64 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140838018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan-Christoph Klie, Richard Eckart de Castilho, Iryna Gurevych
Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models as well as for their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts. While practices and guidelines regarding dataset creation projects exist, to our knowledge, large-scale analysis has yet to be performed on how quality management is conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions for applying them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication, or data validation. Using these annotations, we then analyze how quality management is conducted in practice. A majority of the annotated publications apply good or excellent quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially when using inter-annotator agreement and computing annotation error rates.
{"title":"Analyzing Dataset Annotation Quality Management in the Wild","authors":"Jan-Christoph Klie, Richard Eckart de Castilho, Iryna Gurevych","doi":"10.1162/coli_a_00516","DOIUrl":"https://doi.org/10.1162/coli_a_00516","url":null,"abstract":"Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models as well as for their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts. While practices and guidelines regarding dataset creation projects exist, to our knowledge, large-scale analysis has yet to be performed on how quality management is conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions for applying them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication, or data validation. Using these annotations, we then analyze how quality management is conducted in practice. A majority of the annotated publications apply good or excellent quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially when using inter-annotator agreement and computing annotation error rates.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"17 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140298308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meishan Zhang, Gongyao Jiang, Shuang Liu, Jing Chen, Min Zhang
Dialogue–level dependency parsing, despite its growing academic interest, often encounters underperformance issues due to resource shortages. A potential solution to this challenge is data augmentation. In recent years, large language models (LLMs) have demonstrated strong capabilities in generation which can facilitate data augmentation greatly. In this study, we focus on Chinese dialogue–level dependency parsing, presenting three simple and effective strategies with LLM to augment the original training instances, namely word–level, syntax–level and discourse–level augmentations, respectively. These strategies enable LLMs to either preserve or modify dependency structures, thereby assuring accuracy while increasing the diversity of instances at different levels. We conduct experiments on the benchmark dataset released by Jiang et al. (2023) to validate our approach. Results show that our method can greatly boost the parsing performance in various settings, particularly in dependencies among elementary discourse units (EDUs). Lastly, we provide in–depth analysis to show the key points of our data augmentation strategies.
{"title":"LLM–Assisted Data Augmentation for Chinese Dialogue–Level Dependency Parsing","authors":"Meishan Zhang, Gongyao Jiang, Shuang Liu, Jing Chen, Min Zhang","doi":"10.1162/coli_a_00515","DOIUrl":"https://doi.org/10.1162/coli_a_00515","url":null,"abstract":"Dialogue–level dependency parsing, despite its growing academic interest, often encounters underperformance issues due to resource shortages. A potential solution to this challenge is data augmentation. In recent years, large language models (LLMs) have demonstrated strong capabilities in generation which can facilitate data augmentation greatly. In this study, we focus on Chinese dialogue–level dependency parsing, presenting three simple and effective strategies with LLM to augment the original training instances, namely word–level, syntax–level and discourse–level augmentations, respectively. These strategies enable LLMs to either preserve or modify dependency structures, thereby assuring accuracy while increasing the diversity of instances at different levels. We conduct experiments on the benchmark dataset released by Jiang et al. (2023) to validate our approach. Results show that our method can greatly boost the parsing performance in various settings, particularly in dependencies among elementary discourse units (EDUs). Lastly, we provide in–depth analysis to show the key points of our data augmentation strategies.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"72 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140127870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maja Braović, Damir Krstinić, Maja Štula, Antonia Ivanda
This paper provides a detailed insight into computational approaches for deciphering Bronze Age Aegean and Cypriot scripts, namely the Archanes script and the Archanes formula, Phaistos Disk, Cretan hieroglyphic (including the Malia Altar Stone and Arkalochori Axe), Linear A, Linear B, Cypro-Minoan and Cypriot scripts. The unique contributions of this paper are threefold: 1) a thorough review of major Bronze Age Aegean and Cypriot scripts and inscriptions, digital data and corpora associated with them, existing computational decipherment methods developed in order to decipher them, and possible links to other scripts and languages; 2) the definition of 15 major challenges that can be encountered in computational decipherments of ancient scripts; and 3) an outline of a computational model that could possibly be used to simulate traditional decipherment processes of ancient scripts based on palaeography and epigraphy. In the context of this paper the term decipherment denotes the process of discovery of the language and/or the set of symbols behind an unknown script, and the meaning behind it.
本文详细介绍了破译青铜时代爱琴海和塞浦路斯文字(即阿尔查内斯文字和阿尔查内斯公式、辉斯托斯磁盘、克里特象形文字(包括马利亚祭坛石碑和阿卡洛乔里斧)、线性 A 文字、线性 B 文字、塞浦路斯米诺文字和塞浦路斯文字)的计算方法。本文的独特贡献体现在三个方面:1) 全面回顾了青铜时代爱琴海和塞浦路斯的主要文字和铭文、与之相关的数字数据和语料库、为破译这些文字而开发的现有计算破译方法以及与其他文字和语言的可能联系;2) 界定了古代文字计算破译过程中可能遇到的 15 个主要挑战;3) 概述了一个计算模型,该模型可用于模拟基于古文字学和书信学的古代文字传统破译过程。在本文中,破译一词指的是发现未知文字背后的语言和/或符号集及其含义的过程。
{"title":"A Systematic Review of Computational Approaches to Deciphering Bronze Age Aegean and Cypriot Scripts","authors":"Maja Braović, Damir Krstinić, Maja Štula, Antonia Ivanda","doi":"10.1162/coli_a_00514","DOIUrl":"https://doi.org/10.1162/coli_a_00514","url":null,"abstract":"This paper provides a detailed insight into computational approaches for deciphering Bronze Age Aegean and Cypriot scripts, namely the Archanes script and the Archanes formula, Phaistos Disk, Cretan hieroglyphic (including the Malia Altar Stone and Arkalochori Axe), Linear A, Linear B, Cypro-Minoan and Cypriot scripts. The unique contributions of this paper are threefold: 1) a thorough review of major Bronze Age Aegean and Cypriot scripts and inscriptions, digital data and corpora associated with them, existing computational decipherment methods developed in order to decipher them, and possible links to other scripts and languages; 2) the definition of 15 major challenges that can be encountered in computational decipherments of ancient scripts; and 3) an outline of a computational model that could possibly be used to simulate traditional decipherment processes of ancient scripts based on palaeography and epigraphy. In the context of this paper the term decipherment denotes the process of discovery of the language and/or the set of symbols behind an unknown script, and the meaning behind it.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"25 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140074695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eunkyul Leah Jo, Angela Yoonseo Park, Jungyeul Park
We propose a novel method for calculating PARSEVAL measures to evaluate constituent parsing results. Previous constituent parsing evaluation techniques were constrained by the requirement for consistent sentence boundaries and tokenization results, proving to be stringent and inconvenient. Our new approach handles constituent parsing results obtained from raw text, even when sentence boundaries and tokenization differ from the preprocessed gold sentence. Implementing this measure is our evaluation by alignment approach. The algorithm enables the alignment of tokens and sentences in the gold and system parse trees. Our proposed algorithm draws on the analogy of sentence and word alignment commonly employed in machine translation (MT). To demonstrate the intricacy of calculations and clarify any integration of configurations, we explain the implementations in detailed pseudo-code and provide empirical proof for how sentence and word alignment can improve evaluation reliability.
{"title":"A Novel Alignment-based Approach for PARSEVAL Measures","authors":"Eunkyul Leah Jo, Angela Yoonseo Park, Jungyeul Park","doi":"10.1162/coli_a_00512","DOIUrl":"https://doi.org/10.1162/coli_a_00512","url":null,"abstract":"We propose a novel method for calculating PARSEVAL measures to evaluate constituent parsing results. Previous constituent parsing evaluation techniques were constrained by the requirement for consistent sentence boundaries and tokenization results, proving to be stringent and inconvenient. Our new approach handles constituent parsing results obtained from raw text, even when sentence boundaries and tokenization differ from the preprocessed gold sentence. Implementing this measure is our evaluation by alignment approach. The algorithm enables the alignment of tokens and sentences in the gold and system parse trees. Our proposed algorithm draws on the analogy of sentence and word alignment commonly employed in machine translation (MT). To demonstrate the intricacy of calculations and clarify any integration of configurations, we explain the implementations in detailed pseudo-code and provide empirical proof for how sentence and word alignment can improve evaluation reliability.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"65 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140035802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qing Lyu, Marianna Apidianaki, Chris Callison-Burch
End-to-end neural Natural Language Processing (NLP) models are notoriously difficult to understand. This has given rise to numerous efforts towards model explainability in recent years. One desideratum of model explanation is faithfulness, i.e. an explanation should accurately represent the reasoning process behind the model’s prediction. In this survey, we review over 110 model explanation methods in NLP through the lens of faithfulness. We first discuss the definition and evaluation of faithfulness, as well as its significance for explainability. We then introduce recent advances in faithful explanation, grouping existing approaches into five categories: similarity-based methods, analysis of model-internal structures, backpropagation-based methods, counterfactual intervention, and self-explanatory models. For each category, we synthesize its representative studies, strengths, and weaknesses. Finally, we summarize their common virtues and remaining challenges, and reflect on future work directions towards faithful explainability in NLP.
{"title":"Towards Faithful Model Explanation in NLP: A Survey","authors":"Qing Lyu, Marianna Apidianaki, Chris Callison-Burch","doi":"10.1162/coli_a_00511","DOIUrl":"https://doi.org/10.1162/coli_a_00511","url":null,"abstract":"End-to-end neural Natural Language Processing (NLP) models are notoriously difficult to understand. This has given rise to numerous efforts towards model explainability in recent years. One desideratum of model explanation is faithfulness, i.e. an explanation should accurately represent the reasoning process behind the model’s prediction. In this survey, we review over 110 model explanation methods in NLP through the lens of faithfulness. We first discuss the definition and evaluation of faithfulness, as well as its significance for explainability. We then introduce recent advances in faithful explanation, grouping existing approaches into five categories: similarity-based methods, analysis of model-internal structures, backpropagation-based methods, counterfactual intervention, and self-explanatory models. For each category, we synthesize its representative studies, strengths, and weaknesses. Finally, we summarize their common virtues and remaining challenges, and reflect on future work directions towards faithful explainability in NLP.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"26 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139562465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christo Kirov, Cibu Johny, Anna Katanova, Alexander Gutkin, Brian Roark
While most transliteration research is focused on single tokens such as named entities – e.g., transliteration of “અમદાવાદ” from the Gujarati script to the Latin script “Ahmedabad” – the informal romanization prevalent in South Asia and elsewhere often requires transliteration of full sentences. The lack of large parallel text collections of full sentence (as opposed to single word) transliterations necessitates incorporation of contextual information into transliteration via non-parallel resources, such as via mono-script text collections. In this paper, we present a number of methods for improving transliteration in context for such a use scenario. Some of these methods in fact improve performance without making use of sentential context, allowing for better quantification of the degree to which contextual information in particular is responsible for system improvements. Our final systems, which ultimately rely upon ensembles including large pretrained language models finetuned on simulated parallel data, yield substantial improvements over the best previously reported results for full sentence transliteration from Latin to native script on all 12 languages in the Dakshina dataset (Roark et al. 2020), with an overall 3.3% absolute (18.6% relative) mean word-error rate reduction.
{"title":"Context-aware Transliteration of Romanized South Asian Languages","authors":"Christo Kirov, Cibu Johny, Anna Katanova, Alexander Gutkin, Brian Roark","doi":"10.1162/coli_a_00510","DOIUrl":"https://doi.org/10.1162/coli_a_00510","url":null,"abstract":"While most transliteration research is focused on single tokens such as named entities – e.g., transliteration of “અમદાવાદ” from the Gujarati script to the Latin script “Ahmedabad” – the informal romanization prevalent in South Asia and elsewhere often requires transliteration of full sentences. The lack of large parallel text collections of full sentence (as opposed to single word) transliterations necessitates incorporation of contextual information into transliteration via non-parallel resources, such as via mono-script text collections. In this paper, we present a number of methods for improving transliteration in context for such a use scenario. Some of these methods in fact improve performance without making use of sentential context, allowing for better quantification of the degree to which contextual information in particular is responsible for system improvements. Our final systems, which ultimately rely upon ensembles including large pretrained language models finetuned on simulated parallel data, yield substantial improvements over the best previously reported results for full sentence transliteration from Latin to native script on all 12 languages in the Dakshina dataset (Roark et al. 2020), with an overall 3.3% absolute (18.6% relative) mean word-error rate reduction.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"31 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139509245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Despite impressive advances in Natural Language Generation (NLG) and Large Language Models (LLMs), researchers are still unclear about important aspects of NLG evaluation. To substantiate this claim, I examine current classifications of hallucination and omission in Datatext NLG, and I propose a logic-based synthesis of these classfications. I conclude by highlighting some remaining limitations of all current thinking about hallucination and by discussing implications for LLMs.
{"title":"The Pitfalls of Defining Hallucination","authors":"Kees van Deemter","doi":"10.1162/coli_a_00509","DOIUrl":"https://doi.org/10.1162/coli_a_00509","url":null,"abstract":"Despite impressive advances in Natural Language Generation (NLG) and Large Language Models (LLMs), researchers are still unclear about important aspects of NLG evaluation. To substantiate this claim, I examine current classifications of hallucination and omission in Datatext NLG, and I propose a logic-based synthesis of these classfications. I conclude by highlighting some remaining limitations of all current thinking about hallucination and by discussing implications for LLMs.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"31 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139509022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}