Computational Linguistics最新文献

英文中文

Do Multimodal Large Language Models and Humans Ground Language Similarly? 多模态大型语言模型与人类的语言基础相似吗？

IF 9.3 2区计算机科学

Computational Linguistics

Pub Date : 2024-07-30 DOI: 10.1162/coli_a_00531

Cameron Jones, Benjamin Bergen, Sean Trott

Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world—for failing to solve the “symbol grounding problem.” Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities—and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through “embodied simulation,” the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM’s lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture—despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.

大语言模型（LLMs）因未能将语言意义与世界联系起来--未能解决 "符号基础问题"--而饱受批评。多模态大语言模型（MLLMs）通过将语言表征和处理与其他模态相结合，为这一难题提供了潜在的解决方案。然而，对于多模态大语言模型如何以及在多大程度上整合其不同的模态--它们这样做的方式是否反映了人们认为的人类接地机制--还有很多未知数。据推测，人类的语言意义是通过 "具身模拟"（embodied simulation）来实现的，即通过激活感官运动和情感表征来反映所描述的体验。通过四项预先登记的研究，我们调整了最初为研究人类理解者的具身模拟而开发的实验技术，以探究 MLLM 是否对事件描述中隐含但不明确的感觉运动特征敏感。在实验 1 中，我们发现 MLLM 对某些特征（颜色和形状）很敏感，但对其他特征（大小、方向和体积）却不敏感。在实验 2 中，我们发现了 MLLM 缺乏敏感性的可能瓶颈。在实验 3 中，我们发现尽管 MLLM 对内隐感觉运动特征很敏感，但它并不能完全解释人类在同一任务中的行为。最后，在实验 4 中，我们比较了不同 MLLM 架构的心理测量预测能力，发现单流架构 ViLT 比双编码器架构 CLIP 更能预测人类对一个感觉运动特征（形状）的反应，尽管后者的训练数据要少得多。这些结果揭示了当前 MLLM 将语言与其他模态整合的能力的优势和局限性，同时也揭示了人类语言理解的可能机制。

{"title":"Do Multimodal Large Language Models and Humans Ground Language Similarly?","authors":"Cameron Jones, Benjamin Bergen, Sean Trott","doi":"10.1162/coli_a_00531","DOIUrl":"https://doi.org/10.1162/coli_a_00531","url":null,"abstract":"Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world—for failing to solve the “symbol grounding problem.” Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities—and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through “embodied simulation,” the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM’s lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture—despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"28 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation 跨语言跨时态摘要：数据集、模型、评估

IF 9.3 2区计算机科学

Computational Linguistics

Pub Date : 2024-05-16 DOI: 10.1162/coli_a_00519

Ran Zhang, Jihed Ouni, Steffen Eger

While summarization has been extensively researched in natural language processing (NLP), cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding. This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation. We (1) build the first CLCTS corpus with 328 (+127) instances for hDe-En and 289 (+212) for hEn-De, leveraging historical fiction texts and Wikipedia summaries in English and German; (2) examine the effectiveness of popular transformer end-to-end models with different intermediate finetuning tasks; (3) explore the potential of GPT-3.5 as a summarizer; (4) report evaluations from humans, GPT-4, and several recent automatic evaluation metrics. Our results indicate that intermediate task finetuned end-to-end models generate bad to moderate quality summaries while GPT-3.5, as a zero-shot summarizer, provides moderate to good quality outputs. GPT-3.5 also seems very adept at normalizing historical text. To assess data contamination in GPT-3.5, we design an adversarial attack scheme in which we find that GPT-3.5 performs slightly worse for unseen source documents compared to seen documents. Moreover, it sometimes hallucinates when the source sentences are inverted against its prior knowledge with a summarization accuracy of 0.67 for plot omission, 0.71 for entity swap, and 0.53 for plot negation. Overall, our regression results of model performances suggest that longer, older, and more complex source texts (all of which are more characteristic for historical language variants) are harder to summarize for all models, indicating the difficulty of the CLCTS task. Regarding evaluation, we observe that both GPT-4 and BERTScore correlate moderately with human evaluations but GPT-4 is prone to giving lower scores.

尽管自然语言处理（NLP）领域对摘要化进行了广泛研究，但跨语言跨时空摘要化（CLCTS）在很大程度上是一个尚未开发的领域，它具有提高跨文化可访问性和理解力的潜力。本文全面探讨了 CLCTS 任务，包括数据集创建、建模和评估。我们（1）利用英语和德语的历史小说文本和维基百科摘要，为 hDe-En 和 hEn-De 分别建立了包含 328 个（+127 个）实例和 289 个（+212 个）实例的首个 CLCTS 语料库；（2）检验了流行的转换器端到端模型在不同中间微调任务下的有效性；（3）探索了 GPT-3.5 作为摘要器的潜力；（4）报告了来自人类、GPT-4 和最近几个自动评估指标的评估结果。我们的结果表明，经过中间任务微调的端到端模型生成的摘要质量为中下等，而 GPT-3.5 作为零镜头摘要器，能提供中上等质量的输出。GPT-3.5 似乎还非常擅长对历史文本进行规范化处理。为了评估 GPT-3.5 中的数据污染问题，我们设计了一个对抗攻击方案，发现 GPT-3.5 在处理未见过的源文件时，表现比处理见过的文件时稍差。此外，当源句与其先验知识倒置时，GPT-3.5 有时会出现幻觉，情节遗漏的总结准确率为 0.67，实体交换为 0.71，情节否定为 0.53。总体而言，我们对模型性能的回归结果表明，对于所有模型而言，更长、更旧和更复杂的源文本（所有这些都是历史语言变体的特征）都更难总结，这表明了 CLCTS 任务的难度。在评估方面，我们注意到 GPT-4 和 BERTScore 与人类的评估结果有适度的相关性，但 GPT-4 容易给出较低的分数。

{"title":"Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation","authors":"Ran Zhang, Jihed Ouni, Steffen Eger","doi":"10.1162/coli_a_00519","DOIUrl":"https://doi.org/10.1162/coli_a_00519","url":null,"abstract":"While summarization has been extensively researched in natural language processing (NLP), cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding. This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation. We (1) build the first CLCTS corpus with 328 (+127) instances for hDe-En and 289 (+212) for hEn-De, leveraging historical fiction texts and Wikipedia summaries in English and German; (2) examine the effectiveness of popular transformer end-to-end models with different intermediate finetuning tasks; (3) explore the potential of GPT-3.5 as a summarizer; (4) report evaluations from humans, GPT-4, and several recent automatic evaluation metrics. Our results indicate that intermediate task finetuned end-to-end models generate bad to moderate quality summaries while GPT-3.5, as a zero-shot summarizer, provides moderate to good quality outputs. GPT-3.5 also seems very adept at normalizing historical text. To assess data contamination in GPT-3.5, we design an adversarial attack scheme in which we find that GPT-3.5 performs slightly worse for unseen source documents compared to seen documents. Moreover, it sometimes hallucinates when the source sentences are inverted against its prior knowledge with a summarization accuracy of 0.67 for plot omission, 0.71 for entity swap, and 0.53 for plot negation. Overall, our regression results of model performances suggest that longer, older, and more complex source texts (all of which are more characteristic for historical language variants) are harder to summarize for all models, indicating the difficulty of the CLCTS task. Regarding evaluation, we observe that both GPT-4 and BERTScore correlate moderately with human evaluations but GPT-4 is prone to giving lower scores.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"44 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141059917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Aligning Human and Computational Coherence Evaluations 调整人类和计算的一致性评估

IF 9.3 2区计算机科学

Computational Linguistics

Pub Date : 2024-05-02 DOI: 10.1162/coli_a_00518

Jia Peng Lim, Hady W. Lauw

Automated coherence metrics constitute an efficient and popular way to evaluate topic models. Previous works present a mixed picture of their presumed correlation with human judgment. This work proposes a novel sampling approach to mine topic representations at a large-scale while seeking to mitigate bias from sampling, enabling the investigation of widely-used automated coherence metrics via large corpora. Additionally, this article proposes a novel user study design, an amalgamation of different proxy tasks, to derive a finer insight into the human decision-making processes. This design subsumes the purpose of simple rating and outlier-detection user studies. Similar to the sampling approach, the user study conducted is very extensive, comprising forty study participants split into eight different study groups tasked with evaluating their respective set of one hundred topic representations. Usually, when substantiating the use of these metrics, human responses are treated as the golden standard. This article further investigates the reliability of human judgment by flipping the comparison and conducting a novel extended analysis of human response at the group and individual level against a generic corpus. The investigation results show a moderate to good correlation between these metrics and human judgment, especially for generic corpora, and derive further insights into the human perception of coherence. Analysing inter-metric correlations across corpora shows moderate to good correlation amongst these metrics. As these metrics depend on corpus statistics, this article further investigates the topical differences between corpora revealing nuances in applications of these metrics.

自动一致性度量是评估主题模型的一种有效且流行的方法。以往的研究对这些指标与人类判断的假定相关性的描述不一。本研究提出了一种新颖的抽样方法来大规模挖掘话题表征，同时设法减轻抽样带来的偏差，从而能够通过大型语料库对广泛使用的自动一致性指标进行研究。此外，本文还提出了一种新颖的用户研究设计，即不同代理任务的组合，以便更深入地了解人类的决策过程。这种设计包含了简单的评级和离群值检测用户研究的目的。与抽样方法类似，用户研究的范围也非常广泛，包括 40 名研究参与者，他们被分成 8 个不同的研究小组，分别负责评估各自的 100 个主题表征集。通常，在证实这些指标的使用时，人的反应被视为黄金标准。本文通过翻转比较的方式进一步研究了人类判断的可靠性，并针对通用语料库对人类在群体和个体层面的反应进行了新颖的扩展分析。研究结果表明，这些指标与人类判断之间存在适度到良好的相关性，尤其是在通用语料库中，并进一步揭示了人类对一致性的感知。对不同语料的指标间相关性的分析表明，这些指标之间存在中度到良好的相关性。由于这些指标依赖于语料统计，本文进一步研究了不同语料之间的主题差异，揭示了这些指标在应用中的细微差别。

{"title":"Aligning Human and Computational Coherence Evaluations","authors":"Jia Peng Lim, Hady W. Lauw","doi":"10.1162/coli_a_00518","DOIUrl":"https://doi.org/10.1162/coli_a_00518","url":null,"abstract":"Automated coherence metrics constitute an efficient and popular way to evaluate topic models. Previous works present a mixed picture of their presumed correlation with human judgment. This work proposes a novel sampling approach to mine topic representations at a large-scale while seeking to mitigate bias from sampling, enabling the investigation of widely-used automated coherence metrics via large corpora. Additionally, this article proposes a novel user study design, an amalgamation of different proxy tasks, to derive a finer insight into the human decision-making processes. This design subsumes the purpose of simple rating and outlier-detection user studies. Similar to the sampling approach, the user study conducted is very extensive, comprising forty study participants split into eight different study groups tasked with evaluating their respective set of one hundred topic representations. Usually, when substantiating the use of these metrics, human responses are treated as the golden standard. This article further investigates the reliability of human judgment by flipping the comparison and conducting a novel extended analysis of human response at the group and individual level against a generic corpus. The investigation results show a moderate to good correlation between these metrics and human judgment, especially for generic corpora, and derive further insights into the human perception of coherence. Analysing inter-metric correlations across corpora shows moderate to good correlation amongst these metrics. As these metrics depend on corpus statistics, this article further investigates the topical differences between corpora revealing nuances in applications of these metrics.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"64 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140838018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analyzing Dataset Annotation Quality Management in the Wild 分析野生数据集注释质量管理

IF 9.3 2区计算机科学

Computational Linguistics

Pub Date : 2024-03-25 DOI: 10.1162/coli_a_00516

Jan-Christoph Klie, Richard Eckart de Castilho, Iryna Gurevych

Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models as well as for their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts. While practices and guidelines regarding dataset creation projects exist, to our knowledge, large-scale analysis has yet to be performed on how quality management is conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions for applying them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication, or data validation. Using these annotations, we then analyze how quality management is conducted in practice. A majority of the annotated publications apply good or excellent quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially when using inter-annotator agreement and computing annotation error rates.

数据质量对于训练准确、无偏见和可信的机器学习模型以及正确评估这些模型至关重要。然而，最近的研究表明，即使是用于训练和评估最先进模型的流行数据集，也含有不可忽视的错误注释、偏差或伪造数据。虽然已有关于数据集创建项目的实践和指南，但据我们所知，关于创建自然语言数据集时如何进行质量管理以及是否遵循了这些建议的大规模分析尚未开展。因此，我们首先调查并总结了文献中推荐的数据集创建质量管理实践，并提供了应用这些实践的建议。然后，我们汇编了一个包含 591 篇介绍文本数据集的科学出版物的语料库，并对其进行了质量相关方面的注释，如注释者管理、协议、裁定或数据验证。利用这些注释，我们分析了质量管理在实践中是如何进行的。大多数附加注释的出版物都采用了良好或出色的质量管理。不过，我们认为有 30% 的作品的质量不尽如人意。我们的分析还显示了一些常见的错误，尤其是在使用注释者之间的一致性和计算注释错误率时。

{"title":"Analyzing Dataset Annotation Quality Management in the Wild","authors":"Jan-Christoph Klie, Richard Eckart de Castilho, Iryna Gurevych","doi":"10.1162/coli_a_00516","DOIUrl":"https://doi.org/10.1162/coli_a_00516","url":null,"abstract":"Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models as well as for their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts. While practices and guidelines regarding dataset creation projects exist, to our knowledge, large-scale analysis has yet to be performed on how quality management is conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions for applying them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication, or data validation. Using these annotations, we then analyze how quality management is conducted in practice. A majority of the annotated publications apply good or excellent quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially when using inter-annotator agreement and computing annotation error rates.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"17 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140298308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LLM–Assisted Data Augmentation for Chinese Dialogue–Level Dependency Parsing 用于中文对话级依赖关系解析的 LLM 辅助数据扩展

IF 9.3 2区计算机科学

Computational Linguistics

Pub Date : 2024-03-12 DOI: 10.1162/coli_a_00515

Meishan Zhang, Gongyao Jiang, Shuang Liu, Jing Chen, Min Zhang

Dialogue–level dependency parsing, despite its growing academic interest, often encounters underperformance issues due to resource shortages. A potential solution to this challenge is data augmentation. In recent years, large language models (LLMs) have demonstrated strong capabilities in generation which can facilitate data augmentation greatly. In this study, we focus on Chinese dialogue–level dependency parsing, presenting three simple and effective strategies with LLM to augment the original training instances, namely word–level, syntax–level and discourse–level augmentations, respectively. These strategies enable LLMs to either preserve or modify dependency structures, thereby assuring accuracy while increasing the diversity of instances at different levels. We conduct experiments on the benchmark dataset released by Jiang et al. (2023) to validate our approach. Results show that our method can greatly boost the parsing performance in various settings, particularly in dependencies among elementary discourse units (EDUs). Lastly, we provide in–depth analysis to show the key points of our data augmentation strategies.

尽管学术界对对话级依赖关系解析的兴趣与日俱增，但由于资源短缺，对话级依赖关系解析经常会遇到性能不足的问题。解决这一难题的潜在方法是数据增强。近年来，大型语言模型（LLM）在生成方面表现出了强大的能力，可以极大地促进数据扩增。在本研究中，我们以中文对话级依赖解析为重点，提出了三种简单有效的 LLM 扩增原始训练实例的策略，分别是词级扩增、句法级扩增和话语级扩增。这些策略使 LLM 能够保留或修改依赖结构，从而在保证准确性的同时增加不同层次实例的多样性。我们在 Jiang 等人（2023 年）发布的基准数据集上进行了实验，以验证我们的方法。结果表明，我们的方法可以在各种情况下大大提高解析性能，尤其是在基本话语单元（EDU）之间的依赖关系中。最后，我们进行了深入分析，以展示我们的数据增强策略的关键点。

{"title":"LLM–Assisted Data Augmentation for Chinese Dialogue–Level Dependency Parsing","authors":"Meishan Zhang, Gongyao Jiang, Shuang Liu, Jing Chen, Min Zhang","doi":"10.1162/coli_a_00515","DOIUrl":"https://doi.org/10.1162/coli_a_00515","url":null,"abstract":"Dialogue–level dependency parsing, despite its growing academic interest, often encounters underperformance issues due to resource shortages. A potential solution to this challenge is data augmentation. In recent years, large language models (LLMs) have demonstrated strong capabilities in generation which can facilitate data augmentation greatly. In this study, we focus on Chinese dialogue–level dependency parsing, presenting three simple and effective strategies with LLM to augment the original training instances, namely word–level, syntax–level and discourse–level augmentations, respectively. These strategies enable LLMs to either preserve or modify dependency structures, thereby assuring accuracy while increasing the diversity of instances at different levels. We conduct experiments on the benchmark dataset released by Jiang et al. (2023) to validate our approach. Results show that our method can greatly boost the parsing performance in various settings, particularly in dependencies among elementary discourse units (EDUs). Lastly, we provide in–depth analysis to show the key points of our data augmentation strategies.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"72 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140127870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Systematic Review of Computational Approaches to Deciphering Bronze Age Aegean and Cypriot Scripts 系统回顾破译青铜时代爱琴海和塞浦路斯文字的计算方法

IF 9.3 2区计算机科学

Computational Linguistics

Pub Date : 2024-03-08 DOI: 10.1162/coli_a_00514

Maja Braović, Damir Krstinić, Maja Štula, Antonia Ivanda

This paper provides a detailed insight into computational approaches for deciphering Bronze Age Aegean and Cypriot scripts, namely the Archanes script and the Archanes formula, Phaistos Disk, Cretan hieroglyphic (including the Malia Altar Stone and Arkalochori Axe), Linear A, Linear B, Cypro-Minoan and Cypriot scripts. The unique contributions of this paper are threefold: 1) a thorough review of major Bronze Age Aegean and Cypriot scripts and inscriptions, digital data and corpora associated with them, existing computational decipherment methods developed in order to decipher them, and possible links to other scripts and languages; 2) the definition of 15 major challenges that can be encountered in computational decipherments of ancient scripts; and 3) an outline of a computational model that could possibly be used to simulate traditional decipherment processes of ancient scripts based on palaeography and epigraphy. In the context of this paper the term decipherment denotes the process of discovery of the language and/or the set of symbols behind an unknown script, and the meaning behind it.

本文详细介绍了破译青铜时代爱琴海和塞浦路斯文字（即阿尔查内斯文字和阿尔查内斯公式、辉斯托斯磁盘、克里特象形文字（包括马利亚祭坛石碑和阿卡洛乔里斧）、线性 A 文字、线性 B 文字、塞浦路斯米诺文字和塞浦路斯文字）的计算方法。本文的独特贡献体现在三个方面：1) 全面回顾了青铜时代爱琴海和塞浦路斯的主要文字和铭文、与之相关的数字数据和语料库、为破译这些文字而开发的现有计算破译方法以及与其他文字和语言的可能联系；2) 界定了古代文字计算破译过程中可能遇到的 15 个主要挑战；3) 概述了一个计算模型，该模型可用于模拟基于古文字学和书信学的古代文字传统破译过程。在本文中，破译一词指的是发现未知文字背后的语言和/或符号集及其含义的过程。

引用次数: 0

A Novel Alignment-based Approach for PARSEVAL Measures 基于对齐的 PARSEVAL 测量新方法

IF 9.3 2区计算机科学

Computational Linguistics

Pub Date : 2024-03-04 DOI: 10.1162/coli_a_00512

Eunkyul Leah Jo, Angela Yoonseo Park, Jungyeul Park

We propose a novel method for calculating PARSEVAL measures to evaluate constituent parsing results. Previous constituent parsing evaluation techniques were constrained by the requirement for consistent sentence boundaries and tokenization results, proving to be stringent and inconvenient. Our new approach handles constituent parsing results obtained from raw text, even when sentence boundaries and tokenization differ from the preprocessed gold sentence. Implementing this measure is our evaluation by alignment approach. The algorithm enables the alignment of tokens and sentences in the gold and system parse trees. Our proposed algorithm draws on the analogy of sentence and word alignment commonly employed in machine translation (MT). To demonstrate the intricacy of calculations and clarify any integration of configurations, we explain the implementations in detailed pseudo-code and provide empirical proof for how sentence and word alignment can improve evaluation reliability.

我们提出了一种计算 PARSEVAL 测量值的新方法来评估成分解析结果。以前的选区解析评估技术受限于对句子边界和标记化结果一致性的要求，因此既严格又不方便。我们的新方法可以处理从原始文本中获得的成分解析结果，即使句子边界和标记化结果与预处理的黄金句子不同。我们的对齐评估方法就是对这一措施的实施。该算法可以对黄金解析树和系统解析树中的标记和句子进行对齐。我们提出的算法借鉴了机器翻译（MT）中常用的句子和单词对齐方法。为了证明计算的复杂性并澄清任何配置的整合，我们用详细的伪代码解释了实现方法，并提供了句子和单词对齐如何提高评估可靠性的经验证明。

引用次数: 0

Towards Faithful Model Explanation in NLP: A Survey 在 NLP 中实现忠实模型解释：一项调查

IF 9.3 2区计算机科学

Computational Linguistics

Pub Date : 2024-01-22 DOI: 10.1162/coli_a_00511

Qing Lyu, Marianna Apidianaki, Chris Callison-Burch

End-to-end neural Natural Language Processing (NLP) models are notoriously difficult to understand. This has given rise to numerous efforts towards model explainability in recent years. One desideratum of model explanation is faithfulness, i.e. an explanation should accurately represent the reasoning process behind the model’s prediction. In this survey, we review over 110 model explanation methods in NLP through the lens of faithfulness. We first discuss the definition and evaluation of faithfulness, as well as its significance for explainability. We then introduce recent advances in faithful explanation, grouping existing approaches into five categories: similarity-based methods, analysis of model-internal structures, backpropagation-based methods, counterfactual intervention, and self-explanatory models. For each category, we synthesize its representative studies, strengths, and weaknesses. Finally, we summarize their common virtues and remaining challenges, and reflect on future work directions towards faithful explainability in NLP.

众所周知，端到端神经自然语言处理（NLP）模型很难理解。因此，近年来人们为模型的可解释性做出了许多努力。模型解释的一个必要条件是忠实性，即解释应准确地表达模型预测背后的推理过程。在本调查中，我们从忠实性的角度回顾了 NLP 中的 110 多种模型解释方法。我们首先讨论了忠实性的定义和评估，以及它对可解释性的意义。然后，我们介绍了忠实解释的最新进展，并将现有方法分为五类：基于相似性的方法、模型内部结构分析、基于反向传播的方法、反事实干预和自解释模型。对于每一类方法，我们都对其代表性研究、优势和劣势进行了综述。最后，我们总结了它们的共同优点和仍然存在的挑战，并思考了未来在 NLP 中实现忠实可解释性的工作方向。

引用次数: 0

Context-aware Transliteration of Romanized South Asian Languages 罗马化南亚语言的语境感知译法

IF 9.3 2区计算机科学

Computational Linguistics

Pub Date : 2024-01-19 DOI: 10.1162/coli_a_00510

Christo Kirov, Cibu Johny, Anna Katanova, Alexander Gutkin, Brian Roark

While most transliteration research is focused on single tokens such as named entities – e.g., transliteration of “અમદાવાદ” from the Gujarati script to the Latin script “Ahmedabad” – the informal romanization prevalent in South Asia and elsewhere often requires transliteration of full sentences. The lack of large parallel text collections of full sentence (as opposed to single word) transliterations necessitates incorporation of contextual information into transliteration via non-parallel resources, such as via mono-script text collections. In this paper, we present a number of methods for improving transliteration in context for such a use scenario. Some of these methods in fact improve performance without making use of sentential context, allowing for better quantification of the degree to which contextual information in particular is responsible for system improvements. Our final systems, which ultimately rely upon ensembles including large pretrained language models finetuned on simulated parallel data, yield substantial improvements over the best previously reported results for full sentence transliteration from Latin to native script on all 12 languages in the Dakshina dataset (Roark et al. 2020), with an overall 3.3% absolute (18.6% relative) mean word-error rate reduction.

虽然大多数音译研究都集中在命名实体等单个标记上--例如，将古吉拉特文的 "અમદાવાદ "音译为拉丁文的 "艾哈迈达巴德"--但南亚和其他地区盛行的非正式罗马化常常需要音译完整的句子。由于缺乏全句（而非单词）音译的大型平行文本集，因此有必要通过非平行资源（如单脚本文本集）将上下文信息纳入音译。在本文中，我们针对这种使用场景提出了许多改进上下文音译的方法。其中一些方法实际上无需使用句子上下文即可提高性能，从而可以更好地量化上下文信息对系统改进的具体作用程度。我们的最终系统最终依赖于包括在模拟平行数据上微调的大型预训练语言模型在内的集合，与之前报道的 Dakshina 数据集中所有 12 种语言从拉丁文到母语的全句音译的最佳结果（Roark 等人，2020 年）相比，取得了实质性的改进，总体上减少了 3.3% 的绝对平均字错误率（18.6% 的相对平均字错误率）。

{"title":"Context-aware Transliteration of Romanized South Asian Languages","authors":"Christo Kirov, Cibu Johny, Anna Katanova, Alexander Gutkin, Brian Roark","doi":"10.1162/coli_a_00510","DOIUrl":"https://doi.org/10.1162/coli_a_00510","url":null,"abstract":"While most transliteration research is focused on single tokens such as named entities – e.g., transliteration of “અમદાવાદ” from the Gujarati script to the Latin script “Ahmedabad” – the informal romanization prevalent in South Asia and elsewhere often requires transliteration of full sentences. The lack of large parallel text collections of full sentence (as opposed to single word) transliterations necessitates incorporation of contextual information into transliteration via non-parallel resources, such as via mono-script text collections. In this paper, we present a number of methods for improving transliteration in context for such a use scenario. Some of these methods in fact improve performance without making use of sentential context, allowing for better quantification of the degree to which contextual information in particular is responsible for system improvements. Our final systems, which ultimately rely upon ensembles including large pretrained language models finetuned on simulated parallel data, yield substantial improvements over the best previously reported results for full sentence transliteration from Latin to native script on all 12 languages in the Dakshina dataset (Roark et al. 2020), with an overall 3.3% absolute (18.6% relative) mean word-error rate reduction.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"31 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139509245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Pitfalls of Defining Hallucination 定义幻觉的陷阱

IF 9.3 2区计算机科学

Computational Linguistics

Pub Date : 2024-01-19 DOI: 10.1162/coli_a_00509

Kees van Deemter

Despite impressive advances in Natural Language Generation (NLG) and Large Language Models (LLMs), researchers are still unclear about important aspects of NLG evaluation. To substantiate this claim, I examine current classifications of hallucination and omission in Datatext NLG, and I propose a logic-based synthesis of these classfications. I conclude by highlighting some remaining limitations of all current thinking about hallucination and by discussing implications for LLMs.

尽管自然语言生成（NLG）和大型语言模型（LLMs）取得了令人瞩目的进展，但研究人员对 NLG 评估的重要方面仍然不甚了解。为了证实这一点，我研究了目前在 Datatext NLG 中对幻觉和遗漏的分类，并对这些分类提出了基于逻辑的综合方法。最后，我强调了当前关于幻觉的所有思考中仍然存在的一些局限性，并讨论了对 LLM 的影响。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Computational Linguistics

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀