Time flies and it has been close to five and a half years since I became the editor-in-chief of Computational Linguistics on 15 July 2018. In this editorial, I will describe the changes that I have introduced at the journal, and highlight the achievements and challenges of the journal.
{"title":"My Tenure as the Editor-in-Chief of Computational Linguistics","authors":"Hwee Tou Ng","doi":"10.1162/coli_e_00505","DOIUrl":"https://doi.org/10.1162/coli_e_00505","url":null,"abstract":"Time flies and it has been close to five and a half years since I became the editor-in-chief of Computational Linguistics on 15 July 2018. In this editorial, I will describe the changes that I have introduced at the journal, and highlight the achievements and challenges of the journal.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"83 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139422102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anton Thielmann, Arik Reuter, Quentin Seifert, Elisabeth Bergherr, Benjamin Säfken
Extracting and identifying latent topics in large text corpora has gained increasing importance in Natural Language Processing (NLP). Most models, whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or neural topic models, follow the same underlying approach of topic interpretability and topic extraction. We propose a method that incorporates a deeper understanding of both sentence and document themes, and goes beyond simply analyzing word frequencies in the data. Through simple corpus expansion, our model can detect latent topics that may include uncommon words or neologisms, as well as words not present in the documents themselves. Additionally, we propose several new evaluation metrics based on intruder words and similarity measures in the semantic space. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task. We demonstrate the competitive performance of our method with a large benchmark study, and achieve superior results compared to state-of-the-art topic modeling and document clustering models.
{"title":"Topics in the Haystack: Enhancing Topic Quality through Corpus Expansion","authors":"Anton Thielmann, Arik Reuter, Quentin Seifert, Elisabeth Bergherr, Benjamin Säfken","doi":"10.1162/coli_a_00506","DOIUrl":"https://doi.org/10.1162/coli_a_00506","url":null,"abstract":"Extracting and identifying latent topics in large text corpora has gained increasing importance in Natural Language Processing (NLP). Most models, whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or neural topic models, follow the same underlying approach of topic interpretability and topic extraction. We propose a method that incorporates a deeper understanding of both sentence and document themes, and goes beyond simply analyzing word frequencies in the data. Through simple corpus expansion, our model can detect latent topics that may include uncommon words or neologisms, as well as words not present in the documents themselves. Additionally, we propose several new evaluation metrics based on intruder words and similarity measures in the semantic space. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task. We demonstrate the competitive performance of our method with a large benchmark study, and achieve superior results compared to state-of-the-art topic modeling and document clustering models.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"14 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139409651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While conducting a coordinated set of repeat runs of human evaluation experiments in NLP, we discovered flaws in every single experiment we selected for inclusion via a systematic process. In this paper, we describe the types of flaws we discovered which include coding errors (e.g., loading the wrong system outputs to evaluate), failure to follow standard scientific practice (e.g., ad hoc exclusion of participants and responses), and mistakes in reported numerical results (e.g., reported numbers not matching experimental data). If these problems are widespread, it would have worrying implications for the rigour of NLP evaluation experiments as currently conducted. We discuss what researchers can do to reduce the occurrence of such flaws, including pre-registration, better code development practices, increased testing and piloting, and post-publication addressing of errors.
{"title":"Common Flaws in Running Human Evaluation Experiments in NLP","authors":"Craig Thomson, Ehud Reiter, Anya Belz","doi":"10.1162/coli_a_00508","DOIUrl":"https://doi.org/10.1162/coli_a_00508","url":null,"abstract":"While conducting a coordinated set of repeat runs of human evaluation experiments in NLP, we discovered flaws in every single experiment we selected for inclusion via a systematic process. In this paper, we describe the types of flaws we discovered which include coding errors (e.g., loading the wrong system outputs to evaluate), failure to follow standard scientific practice (e.g., ad hoc exclusion of participants and responses), and mistakes in reported numerical results (e.g., reported numbers not matching experimental data). If these problems are widespread, it would have worrying implications for the rigour of NLP evaluation experiments as currently conducted. We discuss what researchers can do to reduce the occurrence of such flaws, including pre-registration, better code development practices, increased testing and piloting, and post-publication addressing of errors.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"45 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139415273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multiple measures, such as WEAT or MAC, attempt to quantify the magnitude of bias present in word embeddings in terms of a single-number metric. However, such metrics and the related statistical significance calculations rely on treating pre-averaged data as individual data points and employing bootstrapping techniques with low sample sizes. We show that similar results can be easily obtained using such methods even if the data are generated by a null model lacking the intended bias. Consequently, we argue that this approach generates false confidence. To address this issue, we propose a Bayesian alternative: hierarchical Bayesian modeling, which enables a more uncertainty-sensitive inspection of bias in word embeddings at different levels of granularity. To showcase our method, we apply it to Religion, Gender, and Race word lists from the original research, together with our control neutral word lists. We deploy the method using Google, Glove, and Reddit embeddings. Further, we utilize our approach to evaluate a debiasing technique applied to the Reddit word embedding. Our findings reveal a more complex landscape than suggested by the proponents of single-number metrics. The datasets and source code for the paper are publicly available.
WEAT 或 MAC 等多种度量方法都试图用单一数字度量词嵌入中存在的偏差大小。然而,这些指标和相关的统计显著性计算依赖于将预平均数据视为单个数据点,并采用低样本量的引导技术。我们的研究表明,即使数据是由缺乏预期偏差的空模型生成的,使用这种方法也能轻松获得类似的结果。因此,我们认为这种方法会产生错误的置信度。为了解决这个问题,我们提出了一种贝叶斯替代方法:分层贝叶斯建模,它可以在不同粒度水平上对词嵌入中的偏差进行对不确定性更加敏感的检验。为了展示我们的方法,我们将其应用于原始研究中的宗教、性别和种族词表,以及我们的对照中性词表。我们使用 Google、Glove 和 Reddit 嵌入来部署该方法。此外,我们还利用我们的方法对应用于 Reddit 词嵌入的去伪存真技术进行了评估。我们的研究结果揭示了比单一数字度量支持者所认为的更为复杂的情况。本文的数据集和源代码均可公开获取。
{"title":"A Bayesian approach to uncertainty in word embedding bias estimation","authors":"Alicja Dobrzeniecka, Rafal Urbaniak","doi":"10.1162/coli_a_00507","DOIUrl":"https://doi.org/10.1162/coli_a_00507","url":null,"abstract":"Multiple measures, such as WEAT or MAC, attempt to quantify the magnitude of bias present in word embeddings in terms of a single-number metric. However, such metrics and the related statistical significance calculations rely on treating pre-averaged data as individual data points and employing bootstrapping techniques with low sample sizes. We show that similar results can be easily obtained using such methods even if the data are generated by a null model lacking the intended bias. Consequently, we argue that this approach generates false confidence. To address this issue, we propose a Bayesian alternative: hierarchical Bayesian modeling, which enables a more uncertainty-sensitive inspection of bias in word embeddings at different levels of granularity. To showcase our method, we apply it to Religion, Gender, and Race word lists from the original research, together with our control neutral word lists. We deploy the method using Google, Glove, and Reddit embeddings. Further, we utilize our approach to evaluate a debiasing technique applied to the Reddit word embedding. Our findings reveal a more complex landscape than suggested by the proponents of single-number metrics. The datasets and source code for the paper are publicly available.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"14 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139409608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Semantic representations capture the meaning of a text. Meaning Representation (AMR), a type of semantic representation, focuses on predicate-argument structure and abstracts away from surface form. Though AMR was developed initially for English, it has now been adapted to a multitude of languages in the form of non-English annotation schemas, cross-lingual text-to-AMR parsing, and AMR-to-(non-English) text generation. We advance prior work on cross-lingual AMR by thoroughly investigating the amount, types, and causes of differences which appear in AMRs of different languages. Further, we compare how AMR captures meaning in cross-lingual pairs versus strings, and show that AMR graphs are able to draw out fine-grained differences between parallel sentences. We explore three primary research questions: (1) What are the types and causes of differences in parallel AMRs? (2) How can we measure the amount of difference between AMR pairs in different languages? (3) Given that AMR structure is affected by language and exhibits cross-lingual differences, how do cross-lingual AMR pairs compare to string-based representations of cross-lingual sentence pairs? We find that the source language itself does have a measurable impact on AMR structure, and that translation divergences and annotator choices also lead to differences in cross-lingual AMR pairs. We explore the implications of this finding throughout our study, concluding that, while AMR is useful to capture meaning across languages, evaluations need to take into account source language influences if they are to paint an accurate picture of system output, and meaning generally.
语义表征捕捉文本的意义。意义表征(AMR)是语义表征的一种,它侧重于谓词-论据结构,并从表面形式中抽象出来。虽然 AMR 最初是针对英语开发的,但现在它已通过非英语注释模式、跨语言文本到 AMR 的解析以及 AMR 到(非英语)文本的生成等形式适用于多种语言。通过深入研究不同语言 AMR 中出现的差异的数量、类型和原因,我们推进了之前在跨语言 AMR 方面的工作。此外,我们还比较了 AMR 如何捕捉跨语言对和字符串中的意义,并表明 AMR 图能够找出平行句子之间的细粒度差异。我们探讨了三个主要研究问题:(1) 平行 AMR 差异的类型和原因是什么?(2) 如何测量不同语言中 AMR 对之间的差异量?(3) 既然 AMR 结构受语言影响并表现出跨语言差异,那么跨语言 AMR 对与跨语言句子对的基于字符串的表述相比如何?我们发现,源语言本身确实对 AMR 结构有可衡量的影响,而翻译差异和注释者的选择也会导致跨语言 AMR 对的差异。我们在整个研究过程中探讨了这一发现的意义,并得出结论:虽然 AMR 对于捕捉跨语言意义非常有用,但如果要准确描绘系统输出和一般意义,评估需要考虑源语言的影响。
{"title":"Assessing the Cross-linguistic Utility of Abstract Meaning Representation","authors":"Shira Wein, Nathan Schneider","doi":"10.1162/coli_a_00503","DOIUrl":"https://doi.org/10.1162/coli_a_00503","url":null,"abstract":"Semantic representations capture the meaning of a text. Meaning Representation (AMR), a type of semantic representation, focuses on predicate-argument structure and abstracts away from surface form. Though AMR was developed initially for English, it has now been adapted to a multitude of languages in the form of non-English annotation schemas, cross-lingual text-to-AMR parsing, and AMR-to-(non-English) text generation. We advance prior work on cross-lingual AMR by thoroughly investigating the amount, types, and causes of differences which appear in AMRs of different languages. Further, we compare how AMR captures meaning in cross-lingual pairs versus strings, and show that AMR graphs are able to draw out fine-grained differences between parallel sentences. We explore three primary research questions: (1) What are the types and causes of differences in parallel AMRs? (2) How can we measure the amount of difference between AMR pairs in different languages? (3) Given that AMR structure is affected by language and exhibits cross-lingual differences, how do cross-lingual AMR pairs compare to string-based representations of cross-lingual sentence pairs? We find that the source language itself does have a measurable impact on AMR structure, and that translation divergences and annotator choices also lead to differences in cross-lingual AMR pairs. We explore the implications of this finding throughout our study, concluding that, while AMR is useful to capture meaning across languages, evaluations need to take into account source language influences if they are to paint an accurate picture of system output, and meaning generally.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"3 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138819238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Divergence of languages observed at the surface level is a major challenge encountered by multilingual data representation, especially when typologically distant languages are involved. Drawing inspirations from a formalist Chomskyan perspective towards language universals, Universal Grammar (UG), this article employs deductively pre-defined universals to analyse a multilingually heterogeneous phenomenon, event nominals. In this way, deeper universality of event nominals beneath their huge divergence in different languages is uncovered, which empowers us to break barriers between languages and thus extend insights from some synthetic languages to a non-inflectional language, Mandarin Chinese. Our empirical investigation also demonstrates this UG-inspired schema is effective: with its assistance, the inter-annotator agreement (IAA) for identifying event nominals in Mandarin grows from 88.02% to 94.99%, and automatic detection of event-reading nominalizations on the newly-established data achieves an accuracy of 94.76% and an F1 score of 91.3%, which are significantly surpass those achieved on the pre-existing resource by 9.8% and 5.2% respectively. Our systematic analysis also sheds light on nominal semantic role labelling (SRL). By providing a clear definition and classification on arguments of event nominal, the IAA of this task significantly increases from 90.46% to 98.04%.
在表层观察到的语言差异是多语言数据表示所面临的一大挑战,尤其是涉及到类型学上相距甚远的语言时。本文从形式主义乔姆斯基的语言普遍性视角--普遍语法(UG)中汲取灵感,采用演绎预设普遍性来分析多语言异质现象--事件称谓语。通过这种方法,我们发现了事件称谓语在不同语言中巨大差异下的深层普遍性,这使我们有能力打破语言之间的障碍,从而将从一些合成语言中获得的见解扩展到非会意语言--汉语普通话中。我们的实证调查也证明了这种受 UG 启发的模式是有效的:在它的帮助下,普通话中识别事件名词的标注者间一致率(IAA)从 88.02% 提高到了 94.99%,在新建立的数据上自动检测事件阅读名词的准确率达到了 94.76%,F1 得分为 91.3%,分别比在原有资源上的准确率和 F1 得分高出 9.8% 和 5.2%。我们的系统分析还揭示了名义语义角色标签(SRL)。通过对事件名词的参数进行明确的定义和分类,这项任务的 IAA 从 90.46% 显著提高到 98.04%。
{"title":"UG-schematic Annotation for Event Nominals: A Case Study in Mandarin Chinese","authors":"Wenxi Li, Guy Emerson, Yutong Zhang, Weiwei Sun","doi":"10.1162/coli_a_00504","DOIUrl":"https://doi.org/10.1162/coli_a_00504","url":null,"abstract":"Divergence of languages observed at the surface level is a major challenge encountered by multilingual data representation, especially when typologically distant languages are involved. Drawing inspirations from a formalist Chomskyan perspective towards language universals, Universal Grammar (UG), this article employs deductively pre-defined universals to analyse a multilingually heterogeneous phenomenon, event nominals. In this way, deeper universality of event nominals beneath their huge divergence in different languages is uncovered, which empowers us to break barriers between languages and thus extend insights from some synthetic languages to a non-inflectional language, Mandarin Chinese. Our empirical investigation also demonstrates this UG-inspired schema is effective: with its assistance, the inter-annotator agreement (IAA) for identifying event nominals in Mandarin grows from 88.02% to 94.99%, and automatic detection of event-reading nominalizations on the newly-established data achieves an accuracy of 94.76% and an F1 score of 91.3%, which are significantly surpass those achieved on the pre-existing resource by 9.8% and 5.2% respectively. Our systematic analysis also sheds light on nominal semantic role labelling (SRL). By providing a clear definition and classification on arguments of event nominal, the IAA of this task significantly increases from 90.46% to 98.04%.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"3 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138819390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Caleb Ziems, Omar Shaikh, Zhehao Zhang, William Held, Jiaao Chen, Diyi Yang
Large Language Models (LLMs) are capable of successfully performing many language processing tasks zero-shot (without training data). If zero-shot LLMs can also reliably classify and explain social phenomena like persuasiveness and political ideology, then LLMs could augment the Computational Social Science (CSS) pipeline in important ways. This work provides a road map for using LLMs as CSS tools. Towards this end, we contribute a set of prompting best practices and an extensive evaluation pipeline to measure the zero-shot performance of 13 language models on 25 representative English CSS benchmarks. On taxonomic labeling tasks (classification), LLMs fail to outperform the best fine-tuned models but still achieve fair levels of agreement with humans. On free-form coding tasks (generation), LLMs produce explanations that often exceed the quality of crowdworkers' gold references.We conclude that the performance of today’s LLMs can augment the CSS research pipeline in two ways: (1) serving as zeroshot data annotators on human annotation teams, and (2) bootstrapping challenging creative generation tasks (e.g., explaining the underlying attributes of a text). In summary, LLMs are posed to meaningfully participate in social science analysis in partnership with humans.
{"title":"Can Large Language Models Transform Computational Social Science?","authors":"Caleb Ziems, Omar Shaikh, Zhehao Zhang, William Held, Jiaao Chen, Diyi Yang","doi":"10.1162/coli_a_00502","DOIUrl":"https://doi.org/10.1162/coli_a_00502","url":null,"abstract":"Large Language Models (LLMs) are capable of successfully performing many language processing tasks zero-shot (without training data). If zero-shot LLMs can also reliably classify and explain social phenomena like persuasiveness and political ideology, then LLMs could augment the Computational Social Science (CSS) pipeline in important ways. This work provides a road map for using LLMs as CSS tools. Towards this end, we contribute a set of prompting best practices and an extensive evaluation pipeline to measure the zero-shot performance of 13 language models on 25 representative English CSS benchmarks. On taxonomic labeling tasks (classification), LLMs fail to outperform the best fine-tuned models but still achieve fair levels of agreement with humans. On free-form coding tasks (generation), LLMs produce explanations that often exceed the quality of crowdworkers' gold references.We conclude that the performance of today’s LLMs can augment the CSS research pipeline in two ways: (1) serving as zeroshot data annotators on human annotation teams, and (2) bootstrapping challenging creative generation tasks (e.g., explaining the underlying attributes of a text). In summary, LLMs are posed to meaningfully participate in social science analysis in partnership with humans.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"103 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138579433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rudra Ranajee Saha, Raymond T. Ng, Laks V. S. Lakshmanan
Identification of stance has recently gained a lot of attention with the extreme growth of fake news and filter bubbles. Over the last decade, many feature-based and deep-learning approaches have been proposed to solve Stance Detection. However, almost none of the existing works focus on providing a meaningful explanation for their prediction. In this work, we study Stance Detection with an emphasis on generating explanations for the predicted stance by capturing the pivotal argumentative structure embedded in a document. We propose to build a Stance Tree which utilizes Rhetorical Parsing to construct an evidence tree and to use Dempster Shafer Theory to aggregate the evidence. Human studies show that our unsupervised technique of generating stance explanations outperforms the SOTA extractive summarization method in terms of informativeness, non-redundancy, coverage, and overall quality. Furthermore, experiments show that our explanation-based stance prediction excels or matches the performance of the SOTA model on various benchmark datasets.
最近,随着假新闻和过滤泡沫的急剧增长,立场识别受到了广泛关注。在过去十年中,人们提出了许多基于特征和深度学习的方法来解决立场检测问题。然而,几乎所有现有作品都没有专注于为其预测提供有意义的解释。在这项工作中,我们研究了立场检测,重点是通过捕捉文档中嵌入的关键论证结构,为预测的立场生成解释。我们建议建立立场树,利用修辞解析法来构建证据树,并使用 Dempster Shafer 理论来汇总证据。人工研究表明,我们的无监督立场解释生成技术在信息量、非冗余性、覆盖率和整体质量方面都优于 SOTA 提取摘要方法。此外,实验表明,我们基于解释的立场预测在各种基准数据集上的表现都优于或与 SOTA 模型不相上下。
{"title":"Stance Detection with Explanations","authors":"Rudra Ranajee Saha, Raymond T. Ng, Laks V. S. Lakshmanan","doi":"10.1162/coli_a_00501","DOIUrl":"https://doi.org/10.1162/coli_a_00501","url":null,"abstract":"Identification of stance has recently gained a lot of attention with the extreme growth of fake news and filter bubbles. Over the last decade, many feature-based and deep-learning approaches have been proposed to solve Stance Detection. However, almost none of the existing works focus on providing a meaningful explanation for their prediction. In this work, we study Stance Detection with an emphasis on generating explanations for the predicted stance by capturing the pivotal argumentative structure embedded in a document. We propose to build a Stance Tree which utilizes Rhetorical Parsing to construct an evidence tree and to use Dempster Shafer Theory to aggregate the evidence. Human studies show that our unsupervised technique of generating stance explanations outperforms the SOTA extractive summarization method in terms of informativeness, non-redundancy, coverage, and overall quality. Furthermore, experiments show that our explanation-based stance prediction excels or matches the performance of the SOTA model on various benchmark datasets.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"18 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138579437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Polysemy is the type of lexical ambiguity where a word has multiple distinct but related interpretations. In the past decade, it has been the subject of a great many studies across multiple disciplines including linguistics, psychology, neuroscience, and computational linguistics, which have made it increasingly clear that the complexity of polysemy precludes simple, universal answers, especially concerning the representation and processing of polysemous words. But fuelled by the growing availability of large, crowdsourced datasets providing substantial empirical evidence; improved behavioral methodology; and the development of contextualised language models capable of encoding the fine-grained meaning of a word within a given context, the literature on polysemy recently has developed more complex theoretical analyses. In this survey we discuss these recent contributions to the investigation of polysemy against the backdrop of a long legacy of research across multiple decades and disciplines. Our aim is to bring together different perspectives to achieve a more complete picture of the heterogeneity and complexity of the phenomenon of polysemy. Specifically, we highlight evidence supporting a range of hybrid models of the mental processing of polysemes. These hybrid models combine elements from different previous theoretical approaches to explain patterns and idiosyncrasies in the processing of polysemous that the best known models so far have failed to account for. Our literature review finds that i) traditional analyses of polysemy can be limited in their generalisability by loose definitions and selective materials; ii) linguistic tests provide useful evidence on individual cases, but fail to capture the full range of factors involved in the processing of polysemous sense extensions; and iii) recent behavioural (psycho) linguistics studies, largescale annotation efforts and investigations leveraging contextualised language models provide accumulating evidence suggesting that polysemous sense similarity covers a wide spectrum between identity of sense and homonymy-like unrelatedness of meaning. We hope that the interdisciplinary account of polysemy provided in this survey inspires further fundamental research on the nature of polysemy and better equips applied research to deal with the complexity surrounding the phenomenon, e.g. by enabling the development of benchmarks and testing paradigms for large language models informed by a greater portion of the rich evidence on the phenomenon currently available.
{"title":"Polysemy - Evidence from Linguistics, Behavioural Science and Contextualised Language Models","authors":"Janosch Haber, Massimo Poesio","doi":"10.1162/coli_a_00500","DOIUrl":"https://doi.org/10.1162/coli_a_00500","url":null,"abstract":"Polysemy is the type of lexical ambiguity where a word has multiple distinct but related interpretations. In the past decade, it has been the subject of a great many studies across multiple disciplines including linguistics, psychology, neuroscience, and computational linguistics, which have made it increasingly clear that the complexity of polysemy precludes simple, universal answers, especially concerning the representation and processing of polysemous words. But fuelled by the growing availability of large, crowdsourced datasets providing substantial empirical evidence; improved behavioral methodology; and the development of contextualised language models capable of encoding the fine-grained meaning of a word within a given context, the literature on polysemy recently has developed more complex theoretical analyses. In this survey we discuss these recent contributions to the investigation of polysemy against the backdrop of a long legacy of research across multiple decades and disciplines. Our aim is to bring together different perspectives to achieve a more complete picture of the heterogeneity and complexity of the phenomenon of polysemy. Specifically, we highlight evidence supporting a range of hybrid models of the mental processing of polysemes. These hybrid models combine elements from different previous theoretical approaches to explain patterns and idiosyncrasies in the processing of polysemous that the best known models so far have failed to account for. Our literature review finds that i) traditional analyses of polysemy can be limited in their generalisability by loose definitions and selective materials; ii) linguistic tests provide useful evidence on individual cases, but fail to capture the full range of factors involved in the processing of polysemous sense extensions; and iii) recent behavioural (psycho) linguistics studies, largescale annotation efforts and investigations leveraging contextualised language models provide accumulating evidence suggesting that polysemous sense similarity covers a wide spectrum between identity of sense and homonymy-like unrelatedness of meaning. We hope that the interdisciplinary account of polysemy provided in this survey inspires further fundamental research on the nature of polysemy and better equips applied research to deal with the complexity surrounding the phenomenon, e.g. by enabling the development of benchmarks and testing paradigms for large language models informed by a greater portion of the rich evidence on the phenomenon currently available.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"8 1","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138579445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}