Computational Linguistics最新文献

英文中文

Generation and Polynomial Parsing of Graph Languages with Non-Structural Reentrancies 非结构重入图语言的生成与多项式解析

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2023-07-06 DOI: 10.1162/coli_a_00488

Johanna Björklund, F. Drewes, Anna Jonsson

Graph-based semantic representations are popular in natural language processing (NLP), where it is often convenient to model linguistic concepts as nodes and relations as edges between them. Several attempts have been made to find a generative device that is sufficiently powerful to describe languages of semantic graphs, while at the same allowing efficient parsing. We contribute to this line of work by introducing graph extension grammar, a variant of the contextual hyperedge replacement grammars proposed by Hoffmann et al. Contextual hyperedge replacement can generate graphs with non-structural reentrancies, a type of node-sharing that is very common in formalisms such as abstract meaning representation, but which context-free types of graph grammars cannot model. To provide our formalism with a way to place reentrancies in a linguistically meaningful way, we endow rules with logical formulas in counting monadic second-order logic. We then present a parsing algorithm and show as our main result that this algorithm runs in polynomial time on graph languages generated by a subclass of our grammars, the so-called local graph extension grammars.

基于图的语义表示在自然语言处理（NLP）中很流行，通常将语言概念建模为节点，将它们之间的关系建模为边是很方便的。已经进行了几次尝试，以找到一种足够强大的生成设备来描述语义图的语言，同时允许有效的解析。我们通过引入图扩展语法为这一工作做出了贡献，图扩展语法是Hoffmann等人提出的上下文超边替换语法的变体。上下文超边替代可以生成具有非结构重入的图，这是一种在形式主义中非常常见的节点共享类型，如抽象意义表示，但是哪些上下文无关类型的图语法不能建模。为了给我们的形式主义提供一种以语言意义的方式放置可重入性的方法，我们在计算一元二阶逻辑时赋予规则逻辑公式。然后，我们提出了一个解析算法，并作为我们的主要结果表明，该算法在由我们的语法的一个子类，即所谓的局部图扩展语法生成的图语言上以多项式时间运行。

{"title":"Generation and Polynomial Parsing of Graph Languages with Non-Structural Reentrancies","authors":"Johanna Björklund, F. Drewes, Anna Jonsson","doi":"10.1162/coli_a_00488","DOIUrl":"https://doi.org/10.1162/coli_a_00488","url":null,"abstract":"\u0000 Graph-based semantic representations are popular in natural language processing (NLP), where it is often convenient to model linguistic concepts as nodes and relations as edges between them. Several attempts have been made to find a generative device that is sufficiently powerful to describe languages of semantic graphs, while at the same allowing efficient parsing. We contribute to this line of work by introducing graph extension grammar, a variant of the contextual hyperedge replacement grammars proposed by Hoffmann et al. Contextual hyperedge replacement can generate graphs with non-structural reentrancies, a type of node-sharing that is very common in formalisms such as abstract meaning representation, but which context-free types of graph grammars cannot model. To provide our formalism with a way to place reentrancies in a linguistically meaningful way, we endow rules with logical formulas in counting monadic second-order logic. We then present a parsing algorithm and show as our main result that this algorithm runs in polynomial time on graph languages generated by a subclass of our grammars, the so-called local graph extension grammars.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47831566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Languages through the Looking Glass of BPE Compression 从BPE压缩的角度看语言

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2023-07-06 DOI: 10.1162/coli_a_00489

Ximena Gutierrez-Vasques, C. Bentz, T. Samardžić

Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It uncovers redundant patterns for compressing the data, and hence alleviates the sparsity problem in downstream applications. Subwords discovered during the first merge operations tend to have the most substantial impact on the compression of texts. However, the structural underpinnings of this effect have not been analyzed cross-linguistically. We conduct in-depth analyses across 47 typologically diverse languages and three parallel corpora, and thereby show that the types of recurrent patterns that have the strongest impact on compression are an indicator of morphological typology. For languages with richer inflectional morphology there is a preference for highly productive subwords on the early merges, while for languages with less inflectional morphology, idiosyncratic subwords are more prominent. Both types of patterns contribute to efficient compression. Counter the common perception that BPE subwords are not linguistically relevant, we find patterns across languages that resemble those described in traditional typology. We thus propose a novel way to characterize languages according to their BPE subword properties, inspired by the notion of morphological productivity in linguistics. This allows us to have language vectors that encode typological knowledge induced from raw text. Our approach is easily applicable to a wider range of languages and texts, as it does not require annotated data or any external linguistic knowledge. We discuss its potential contributions to quantitative typology and multilingual NLP.

字节对编码（BPE）在NLP中被广泛用于执行子字标记化。它揭示了用于压缩数据的冗余模式，从而缓解了下游应用程序中的稀疏性问题。在第一次合并操作中发现的子词往往对文本的压缩产生最实质性的影响。然而，这种影响的结构基础尚未得到跨语言分析。我们对47种类型多样的语言和三个平行语料库进行了深入分析，从而表明对压缩影响最大的重复模式类型是形态类型学的指标。对于具有更丰富的屈折形态的语言，在早期的合并中倾向于高产出的子词，而对于具有较少屈折形态学的语言，特质子词更为突出。这两种类型的模式都有助于有效的压缩。与BPE子词在语言上不相关的普遍看法相反，我们发现不同语言之间的模式与传统类型学中描述的模式相似。因此，受语言学中形态生产力概念的启发，我们提出了一种根据BPE子词特性来表征语言的新方法。这使我们能够拥有对从原始文本中归纳出的类型学知识进行编码的语言向量。我们的方法很容易适用于更广泛的语言和文本，因为它不需要注释数据或任何外部语言知识。我们讨论了它对数量类型学和多语言NLP的潜在贡献。

{"title":"Languages through the Looking Glass of BPE Compression","authors":"Ximena Gutierrez-Vasques, C. Bentz, T. Samardžić","doi":"10.1162/coli_a_00489","DOIUrl":"https://doi.org/10.1162/coli_a_00489","url":null,"abstract":"\u0000 Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It uncovers redundant patterns for compressing the data, and hence alleviates the sparsity problem in downstream applications. Subwords discovered during the first merge operations tend to have the most substantial impact on the compression of texts. However, the structural underpinnings of this effect have not been analyzed cross-linguistically. We conduct in-depth analyses across 47 typologically diverse languages and three parallel corpora, and thereby show that the types of recurrent patterns that have the strongest impact on compression are an indicator of morphological typology. For languages with richer inflectional morphology there is a preference for highly productive subwords on the early merges, while for languages with less inflectional morphology, idiosyncratic subwords are more prominent. Both types of patterns contribute to efficient compression. Counter the common perception that BPE subwords are not linguistically relevant, we find patterns across languages that resemble those described in traditional typology. We thus propose a novel way to characterize languages according to their BPE subword properties, inspired by the notion of morphological productivity in linguistics. This allows us to have language vectors that encode typological knowledge induced from raw text. Our approach is easily applicable to a wider range of languages and texts, as it does not require annotated data or any external linguistic knowledge. We discuss its potential contributions to quantitative typology and multilingual NLP.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48265607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings 通过投票区嵌入捕捉语言使用的精细区域差异

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2023-06-13 DOI: 10.1162/coli_a_00487

Alex Rosenfeld, L. Hinrichs

Linguistic variation across a region of interest can be captured by partitioning the region into areas and using social media data to train embeddings that represent language use in those areas. Recent work has focused on larger areas, such as cities or counties, to ensure that enough social media data is available in each area, but larger areas have a limited ability to find fine grained distinctions, such as intracity differences in language use. We demonstrate that it is possible to embed smaller areas which can provide higher resolution analyses of language variation. We embed voting precincts which are tiny, evenly sized political divisions for the administration of elections. The issue with modeling language use in small areas is that the data becomes incredibly sparse with many areas having scant social media data.We propose a novel embedding approach that alternates training with smoothing which mitigates these sparsity issues. We focus on linguistic variation across Texas as it is relatively understudied. We developed two novel quantitative evaluations that measure how well the embeddings can be used to capture linguistic variation. The first evaluation measures how well a model can map a dialect given terms specific to that dialect. The second evaluation measures how well a model can map preference of lexical variants. These evaluations show how embedding models could be used directly by sociolinguists and measure how much sociolinguistic information is contained within the embeddings. We complement this second evaluation with a methodology for using embeddings as a kind of genetic code where we identify “genes” that correspond to a sociological variable and connect those “genes” to a linguistic phenomenon thereby connecting sociological phenomena to linguistic ones. Finally, we explore approaches for inferring isoglosses using embeddings.

可以通过将感兴趣区域划分为多个区域并使用社交媒体数据来训练表示这些区域中的语言使用的嵌入来捕捉感兴趣区域之间的语言变化。最近的工作集中在较大的地区，如城市或县，以确保每个地区都有足够的社交媒体数据，但较大的地区发现细粒度差异的能力有限，例如城市内部语言使用的差异。我们证明了嵌入较小的区域是可能的，这可以提供更高分辨率的语言变体分析。我们嵌入了投票区，这些投票区是用于管理选举的小而均匀的政治分区。在小范围内建模语言使用的问题是，由于许多地区的社交媒体数据不足，数据变得极其稀疏。我们提出了一种新的嵌入方法，该方法将训练与平滑交替进行，从而缓解了这些稀疏性问题。我们关注的是得克萨斯州的语言变异，因为它的研究相对不足。我们开发了两种新的定量评估，用于衡量嵌入在捕捉语言变化方面的效果。第一个评估衡量了一个模型在给定特定于方言的术语的情况下映射方言的能力。第二个评估衡量了一个模型在多大程度上能够映射词汇变体的偏好。这些评估显示了嵌入模型如何被社会语言学家直接使用，并衡量嵌入中包含了多少社会语言学信息。我们用一种将嵌入作为一种遗传密码的方法来补充第二种评估，在这种方法中，我们识别对应于社会学变量的“基因”，并将这些“基因”与语言现象联系起来，从而将社会学现象与语言现象连接起来。最后，我们探索了使用嵌入来推断等光泽的方法。

{"title":"Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings","authors":"Alex Rosenfeld, L. Hinrichs","doi":"10.1162/coli_a_00487","DOIUrl":"https://doi.org/10.1162/coli_a_00487","url":null,"abstract":"\u0000 Linguistic variation across a region of interest can be captured by partitioning the region into areas and using social media data to train embeddings that represent language use in those areas. Recent work has focused on larger areas, such as cities or counties, to ensure that enough social media data is available in each area, but larger areas have a limited ability to find fine grained distinctions, such as intracity differences in language use. We demonstrate that it is possible to embed smaller areas which can provide higher resolution analyses of language variation. We embed voting precincts which are tiny, evenly sized political divisions for the administration of elections. The issue with modeling language use in small areas is that the data becomes incredibly sparse with many areas having scant social media data.We propose a novel embedding approach that alternates training with smoothing which mitigates these sparsity issues. We focus on linguistic variation across Texas as it is relatively understudied. We developed two novel quantitative evaluations that measure how well the embeddings can be used to capture linguistic variation. The first evaluation measures how well a model can map a dialect given terms specific to that dialect. The second evaluation measures how well a model can map preference of lexical variants. These evaluations show how embedding models could be used directly by sociolinguists and measure how much sociolinguistic information is contained within the embeddings. We complement this second evaluation with a methodology for using embeddings as a kind of genetic code where we identify “genes” that correspond to a sociological variable and connect those “genes” to a linguistic phenomenon thereby connecting sociological phenomena to linguistic ones. Finally, we explore approaches for inferring isoglosses using embeddings.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48983208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-Lingual Transfer with Language-Specific Subnetworks for Low-Resource Dependency Parsing 基于特定语言子网的低资源依赖解析跨语言传输

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2023-05-25 DOI: 10.1162/coli_a_00482

Rochelle Choenni, Dan Garrette, Ekaterina Shutova

Large multilingual language models typically share their parameters across all languages, which enables cross-lingual task transfer, but learning can also be hindered when training updates from different languages are in conflict. In this article, we propose novel methods for using language-specific subnetworks, which control cross-lingual parameter sharing, to reduce conflicts and increase positive transfer during fine-tuning. We introduce dynamic subnetworks, which are jointly updated with the model, and we combine our methods with meta-learning, an established, but complementary, technique for improving cross-lingual transfer. Finally, we provide extensive analyses of how each of our methods affects the models.

大型多语言模型通常在所有语言之间共享其参数，这使得跨语言任务迁移成为可能，但当来自不同语言的训练更新发生冲突时，学习也会受到阻碍。在本文中，我们提出了使用特定语言子网的新方法，该子网控制跨语言参数共享，以减少冲突并增加微调期间的正迁移。我们引入了与模型共同更新的动态子网络，并将我们的方法与元学习结合起来，元学习是一种已建立但互补的技术，用于改善跨语言迁移。最后，我们对每种方法如何影响模型进行了广泛的分析。

引用次数: 2

Statistical Methods for Annotation Analysis by Silviu Paun, Ron Artstein, and Massimo Poesio Silviu Paun、Ron Artstein和Massimo Poesio注释分析的统计方法

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2023-05-25 DOI: 10.1162/coli_r_00483

Rodrigo Wilkens

引用次数: 0

Machine Learning for Ancient Languages: A Survey 古代语言的机器学习综述

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2023-05-25 DOI: 10.1162/coli_a_00481

Thea Sommerschield, Yannis Assael, John Pavlopoulos, Vanessa Stefanak, Andrew Senior, Chris Dyer, John Bodel, J. Prag, I. Androutsopoulos, Nando de Freitas

Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in Artificial Intelligence and Machine Learning have enabled analyses on a scale and in a detail that are reshaping the field of Humanities, similarly to how microscopes and telescopes have contributed to the realm of Science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script and medium, spanning over three and a half millennia of civilisations around the ancient world. To analyse the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitisation, restoration, attribution, linguistic analysis, textual criticism, translation and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the Humanities and Machine Learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, flagging promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the Humanities and Machine Learning.

古代语言保存了过去的文化和历史。然而，他们的研究充满了困难，专家们必须处理一系列具有挑战性的基于文本的任务，从破译丢失的语言到修复损坏的铭文，再到确定文学作品的作者。技术援助长期以来一直支持对古代文本的研究，但近年来，人工智能和机器学习的进步使分析的规模和细节正在重塑人文领域，类似于显微镜和望远镜对科学领域的贡献。本文旨在对使用机器学习研究以任何语言、文字和媒介书写的古代文本的已发表研究进行全面调查，这些研究跨越了古代世界3.5千年的文明。为了分析相关文献，我们介绍了受古代文献研究步骤启发的任务分类：数字化、修复、归属、语言分析、文本批评、翻译和破译。这项工作提供了三个主要贡献：第一，绘制了人文与机器学习协同作用所开辟的跨学科领域；其次，强调两个领域的专家之间的积极合作是产生有影响力和有吸引力的学术成果的关键；第三，为该领域未来的工作指明有希望的方向。因此，这项工作促进并支持了人文学科和机器学习之间的持续合作。

{"title":"Machine Learning for Ancient Languages: A Survey","authors":"Thea Sommerschield, Yannis Assael, John Pavlopoulos, Vanessa Stefanak, Andrew Senior, Chris Dyer, John Bodel, J. Prag, I. Androutsopoulos, Nando de Freitas","doi":"10.1162/coli_a_00481","DOIUrl":"https://doi.org/10.1162/coli_a_00481","url":null,"abstract":"\u0000 Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in Artificial Intelligence and Machine Learning have enabled analyses on a scale and in a detail that are reshaping the field of Humanities, similarly to how microscopes and telescopes have contributed to the realm of Science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script and medium, spanning over three and a half millennia of civilisations around the ancient world. To analyse the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitisation, restoration, attribution, linguistic analysis, textual criticism, translation and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the Humanities and Machine Learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, flagging promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the Humanities and Machine Learning.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43671296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Dimensions of Explanatory Value in NLP models NLP模型中解释价值的维度

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2023-05-04 DOI: 10.1162/coli_a_00480

Kees van Deemter

Performance on a dataset is often regarded as the key criterion for assessing NLP models. I will argue for a broader perspective, which emphasizes scientific explanation. I will draw on a long tradition in the philosophy of science, and on the Bayesian approach to assessing scientific theories, to argue for a plurality of criteria for assessing NLP models. To illustrate these ideas, I will compare some recent models of language production with each other. I conclude by asking what it would mean for institutional policies if the NLP community took these ideas onboard.

数据集的性能通常被视为评估NLP模型的关键标准。我将主张更广泛的视角，强调科学的解释。我将借鉴科学哲学的悠久传统，以及评估科学理论的贝叶斯方法，为评估NLP模型提出多种标准。为了说明这些观点，我将比较一些最近的语言生产模型。最后，我想问一下，如果NLP社区采纳这些想法，对体制政策意味着什么。

引用次数: 1

Comparing Selective Masking Methods for Depression Detection in Social Media 社交媒体中抑郁症检测的选择性蒙面方法比较

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2023-04-28 DOI: 10.1162/coli_a_00479

Chanapa Pananookooln, Jakrapop Akaranee, Chaklam Silpasuwanchai

Identifying those at risk for depression is a crucial issue where social media provides an excellent platform for examining the linguistic patterns of depressed individuals. A significant challenge in depression classification problem is ensuring that prediction models are not overly dependent on topic keywords i.e., depression keywords, such that it fails to predict when such keywords are unavailable. One promising approach is masking, i.e., by selectively masking various words and asking the model to predict the masked words, the model is forced to learn the inherent language patterns of depression. This study evaluates seven masking techniques. Moreover, to predict the masked words during pre-training or fine-tuning phase was also examined. Last, six class imbalance ratios were compared to determine the robustness of masked words selection methods. Key findings demonstrated that selective masking outperforms random masking in terms of F1-score. The most accurate and robust models were identified. Our research also indicated that reconstructing the masked words during pre-training phase is more advantageous than during the fine-tuning phase. Further discussion and implications were made. This is the first study to comprehensively compare masked words selection methods, which has broad implications for the field of depression classification and general NLP. Our code can be found in: https://github.com/chanapapan/Depression-Detection.

识别那些有抑郁症风险的人是一个关键问题，社交媒体为研究抑郁症患者的语言模式提供了一个很好的平台。抑郁症分类问题的一个重大挑战是确保预测模型不过度依赖主题关键词，即抑郁症关键词，从而在这些关键词不可用时无法预测。一种很有前途的方法是掩蔽，即通过选择性地掩蔽各种单词，并要求模型预测被掩蔽的单词，模型被迫学习抑郁症固有的语言模式。本研究评估了七种掩蔽技术。此外，还研究了在预训练阶段和微调阶段对掩蔽词的预测。最后，比较了六种类别不平衡比率，以确定掩蔽词选择方法的稳健性。主要研究结果表明，在f1得分方面，选择性掩蔽优于随机掩蔽。确定了最准确和最稳健的模型。我们的研究还表明，在预训练阶段重建被屏蔽词比在微调阶段更有利。作了进一步的讨论和影响。这是第一个全面比较掩蔽词选择方法的研究，对抑郁症分类和一般自然语言处理领域具有广泛的意义。我们的代码可以在https://github.com/chanapapan/Depression-Detection中找到。

{"title":"Comparing Selective Masking Methods for Depression Detection in Social Media","authors":"Chanapa Pananookooln, Jakrapop Akaranee, Chaklam Silpasuwanchai","doi":"10.1162/coli_a_00479","DOIUrl":"https://doi.org/10.1162/coli_a_00479","url":null,"abstract":"\u0000 Identifying those at risk for depression is a crucial issue where social media provides an excellent platform for examining the linguistic patterns of depressed individuals. A significant challenge in depression classification problem is ensuring that prediction models are not overly dependent on topic keywords i.e., depression keywords, such that it fails to predict when such keywords are unavailable. One promising approach is masking, i.e., by selectively masking various words and asking the model to predict the masked words, the model is forced to learn the inherent language patterns of depression. This study evaluates seven masking techniques. Moreover, to predict the masked words during pre-training or fine-tuning phase was also examined. Last, six class imbalance ratios were compared to determine the robustness of masked words selection methods. Key findings demonstrated that selective masking outperforms random masking in terms of F1-score. The most accurate and robust models were identified. Our research also indicated that reconstructing the masked words during pre-training phase is more advantageous than during the fine-tuning phase. Further discussion and implications were made. This is the first study to comprehensively compare masked words selection methods, which has broad implications for the field of depression classification and general NLP. Our code can be found in: https://github.com/chanapapan/Depression-Detection.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45179378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reflection of Demographic Background on Word Usage 人口背景对词汇使用的反映

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2023-01-17 DOI: 10.1162/coli_a_00475

Aparna Garimella, Carmen Banea, Rada Mihalcea

The availability of personal writings in electronic format provides researchers in the fields of linguistics, psychology, and computational linguistics with an unprecedented chance to study, on a large scale, the relationship between language use and the demographic background of writers, allowing us to better understand people across different demographics. In this article, we analyze the relation between language and demographics by developing cross-demographic word models to identify words with usage bias, or words that are used in significantly different ways by speakers of different demographics. Focusing on three demographic categories, namely, location, gender, and industry, we identify words with significant usage differences in each category and investigate various approaches of encoding a word’s usage, allowing us to identify language aspects that contribute to the differences. Our word models using topic-based features achieve at least 20% improvement in accuracy over the baseline for all demographic categories, even for scenarios with classification into 15 categories, illustrating the usefulness of topic-based features in identifying word usage differences. Further, we note that for location and industry, topics extracted from immediate context are the best predictors of word usages, hinting at the importance of word meaning and its grammatical function for these demographics, while for gender, topics obtained from longer contexts are better predictors for word usage.

电子格式的个人写作为语言学、心理学和计算语言学领域的研究人员提供了一个前所未有的机会，可以大规模研究语言使用与作家人口背景之间的关系，使我们能够更好地了解不同人口结构的人。在这篇文章中，我们通过开发跨人口统计的单词模型来分析语言和人口统计之间的关系，以识别有使用偏见的单词，或者不同人口统计的说话者以显著不同的方式使用的单词。围绕三个人口统计学类别，即地点、性别和行业，我们确定了每个类别中具有显著用法差异的单词，并研究了对单词用法进行编码的各种方法，使我们能够确定导致差异的语言方面。我们使用基于主题的特征的单词模型在所有人口统计类别的准确率上都比基线提高了至少20%，即使是在分类为15个类别的场景中也是如此，这说明了基于主题的功能在识别单词使用差异方面的有用性。此外，我们注意到，对于地点和行业，从直接上下文中提取的主题是单词用法的最佳预测因素，暗示了单词含义及其语法功能对这些人口统计的重要性，而对于性别，从较长上下文中提取主题是单词使用的更好预测因素。

{"title":"Reflection of Demographic Background on Word Usage","authors":"Aparna Garimella, Carmen Banea, Rada Mihalcea","doi":"10.1162/coli_a_00475","DOIUrl":"https://doi.org/10.1162/coli_a_00475","url":null,"abstract":"The availability of personal writings in electronic format provides researchers in the fields of linguistics, psychology, and computational linguistics with an unprecedented chance to study, on a large scale, the relationship between language use and the demographic background of writers, allowing us to better understand people across different demographics. In this article, we analyze the relation between language and demographics by developing cross-demographic word models to identify words with usage bias, or words that are used in significantly different ways by speakers of different demographics. Focusing on three demographic categories, namely, location, gender, and industry, we identify words with significant usage differences in each category and investigate various approaches of encoding a word’s usage, allowing us to identify language aspects that contribute to the differences. Our word models using topic-based features achieve at least 20% improvement in accuracy over the baseline for all demographic categories, even for scenarios with classification into 15 categories, illustrating the usefulness of topic-based features in identifying word usage differences. Further, we note that for location and industry, topics extracted from immediate context are the best predictors of word usages, hinting at the importance of word meaning and its grammatical function for these demographics, while for gender, topics obtained from longer contexts are better predictors for word usage.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"49 1","pages":"373-394"},"PeriodicalIF":9.3,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46226615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data-driven Cross-lingual Syntax: An Agreement Study with Massively Multilingual Models 数据驱动的跨语言语法:大规模多语言模型的一致性研究

IF 9.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics

Pub Date : 2023-01-13 DOI: 10.1162/coli_a_00472

Andrea Gregor de Varda, M. Marelli

Massively multilingual models such as mBERT and XLM-R are increasingly valued in Natural Language Processing research and applications, due to their ability to tackle the uneven distribution of resources available for different languages. The models’ ability to process multiple languages relying on a shared set of parameters raises the question of whether the grammatical knowledge they extracted during pre-training can be considered as a data-driven cross-lingual grammar. The present work studies the inner workings of mBERT and XLM-R in order to test the cross-lingual consistency of the individual neural units that respond to a precise syntactic phenomenon, that is, number agreement, in five languages (English, German, French, Hebrew, Russian). We found that there is a significant overlap in the latent dimensions that encode agreement across the languages we considered. This overlap is larger (a) for long- vis-à-vis short-distance agreement and (b) when considering XLM-R as compared to mBERT, and peaks in the intermediate layers of the network. We further show that a small set of syntax-sensitive neurons can capture agreement violations across languages; however, their contribution is not decisive in agreement processing.

大规模多语言模型(如mBERT和XLM-R)在自然语言处理研究和应用中越来越受重视，因为它们能够解决不同语言可用资源分布不均的问题。这些模型依靠一组共享参数处理多种语言的能力提出了一个问题，即它们在预训练期间提取的语法知识是否可以被视为数据驱动的跨语言语法。本研究研究了mBERT和XLM-R的内部工作原理，以测试在五种语言(英语，德语，法语，希伯来语，俄语)中响应精确语法现象(即数字一致)的单个神经单元的跨语言一致性。我们发现，在我们所考虑的语言中，编码一致性的潜在维度有很大的重叠。这种重叠更大(a)对于长可视-à-vis短距离协议，(b)与mBERT相比，当考虑XLM-R时，这种重叠更大，并且在网络的中间层达到峰值。我们进一步表明，一小组语法敏感的神经元可以捕获跨语言的协议违反;然而，他们的贡献在协议处理中并不是决定性的。

{"title":"Data-driven Cross-lingual Syntax: An Agreement Study with Massively Multilingual Models","authors":"Andrea Gregor de Varda, M. Marelli","doi":"10.1162/coli_a_00472","DOIUrl":"https://doi.org/10.1162/coli_a_00472","url":null,"abstract":"Massively multilingual models such as mBERT and XLM-R are increasingly valued in Natural Language Processing research and applications, due to their ability to tackle the uneven distribution of resources available for different languages. The models’ ability to process multiple languages relying on a shared set of parameters raises the question of whether the grammatical knowledge they extracted during pre-training can be considered as a data-driven cross-lingual grammar. The present work studies the inner workings of mBERT and XLM-R in order to test the cross-lingual consistency of the individual neural units that respond to a precise syntactic phenomenon, that is, number agreement, in five languages (English, German, French, Hebrew, Russian). We found that there is a significant overlap in the latent dimensions that encode agreement across the languages we considered. This overlap is larger (a) for long- vis-à-vis short-distance agreement and (b) when considering XLM-R as compared to mBERT, and peaks in the intermediate layers of the network. We further show that a small set of syntax-sensitive neurons can capture agreement violations across languages; however, their contribution is not decisive in agreement processing.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"49 1","pages":"261-299"},"PeriodicalIF":9.3,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47159527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Computational Linguistics

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀