Natural Language Engineering最新文献_第5页

A Semantic Parsing Pipeline for Context-Dependent Question Answering over Temporally Structured Data. 在时态结构数据上进行上下文相关问题解答的语义解析管道。

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-05-01 Epub Date: 2021-10-29 DOI: 10.1017/s1351324921000292

Charles Chen, Razvan Bunescu, Cindy Marling

We propose a new setting for question answering in which users can query the system using both natural language and direct interactions within a graphical user interface that displays multiple time series associated with an entity of interest. The user interacts with the interface in order to understand the entity's state and behavior, entailing sequences of actions and questions whose answers may depend on previous factual or navigational interactions. We describe a pipeline implementation where spoken questions are first transcribed into text which is then semantically parsed into logical forms that can be used to automatically extract the answer from the underlying database. The speech recognition module is implemented by adapting a pre-trained LSTM-based architecture to the user's speech, whereas for the semantic parsing component we introduce an LSTM-based encoder-decoder architecture that models context dependency through copying mechanisms and multiple levels of attention over inputs and previous outputs. When evaluated separately, with and without data augmentation, both models are shown to substantially outperform several strong baselines. Furthermore, the full pipeline evaluation shows only a small degradation in semantic parsing accuracy, demonstrating that the semantic parser is robust to mistakes in the speech recognition output. The new question answering paradigm proposed in this paper has the potential to improve the presentation and navigation of the large amounts of sensor data and life events that are generated in many areas of medicine.

我们提出了一种新的问题解答设置，用户可以在图形用户界面上使用自然语言和直接交互方式对系统进行查询，该界面会显示与感兴趣的实体相关联的多个时间序列。用户通过与界面的交互来了解实体的状态和行为，这就需要一系列的操作和问题，其答案可能取决于之前的事实或导航交互。我们介绍了一种流水线实施方法，即首先将口语问题转录为文本，然后将文本语义解析为逻辑形式，用于从底层数据库中自动提取答案。语音识别模块是通过根据用户的语音调整预先训练好的基于 LSTM 的架构来实现的，而对于语义解析组件，我们引入了基于 LSTM 的编码器-解码器架构，该架构通过复制机制和对输入及先前输出的多层次关注来模拟上下文依赖性。在使用和不使用数据增强的情况下分别进行评估时，结果表明这两种模型都大大优于几种强大的基线模型。此外，对整个管道的评估显示，语义解析的准确性只有很小的下降，这表明语义解析器对语音识别输出中的错误具有鲁棒性。本文提出的新问题解答范式有可能改善许多医学领域产生的大量传感器数据和生活事件的呈现和导航。

{"title":"A Semantic Parsing Pipeline for Context-Dependent Question Answering over Temporally Structured Data.","authors":"Charles Chen, Razvan Bunescu, Cindy Marling","doi":"10.1017/s1351324921000292","DOIUrl":"10.1017/s1351324921000292","url":null,"abstract":"<p><p>We propose a new setting for question answering in which users can query the system using both natural language and direct interactions within a graphical user interface that displays multiple time series associated with an entity of interest. The user interacts with the interface in order to understand the entity's state and behavior, entailing sequences of actions and questions whose answers may depend on previous factual or navigational interactions. We describe a pipeline implementation where spoken questions are first transcribed into text which is then semantically parsed into logical forms that can be used to automatically extract the answer from the underlying database. The speech recognition module is implemented by adapting a pre-trained LSTM-based architecture to the user's speech, whereas for the semantic parsing component we introduce an LSTM-based encoder-decoder architecture that models context dependency through copying mechanisms and multiple levels of attention over inputs and previous outputs. When evaluated separately, with and without data augmentation, both models are shown to substantially outperform several strong baselines. Furthermore, the full pipeline evaluation shows only a small degradation in semantic parsing accuracy, demonstrating that the semantic parser is robust to mistakes in the speech recognition output. The new question answering paradigm proposed in this paper has the potential to improve the presentation and navigation of the large amounts of sensor data and life events that are generated in many areas of medicine.</p>","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 2","pages":"769-793"},"PeriodicalIF":2.5,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10348695/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10184074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Emerging trends: Risks 3.0 and proliferation of spyware to 50,000 cell phones 新兴趋势:风险3.0和间谍软件扩散到5万部手机

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-05-01 DOI: 10.1017/s1351324923000141

Kenneth Ward Church, Raman Chandrasekar

Abstract Our last emerging trend article introduced Risks 1.0 (fairness and bias) and Risks 2.0 (addictive, dangerous, deadly, and insanely profitable). This article introduces Risks 3.0 (spyware and cyber weapons). Risks 3.0 are less profitable, but more destructive. We will summarize two recent books, Pegasus: How a Spy in Your Pocket Threatens the End of Privacy, Dignity, and Democracy and This is How They Tell Me the World Ends: The Cyberweapons Arms Race. The first book starts with a leak of 50,000 phone numbers, targeted by spyware named Pegasus. Pegasus uses a zero-click exploit to obtain root access to your phone, taking control of the microphone, camera, GPS, text messages, etc. The list of 50,000 numbers includes journalists, politicians, and academics, as well as their friends and family. Some of these people have been murdered. The second book describes the history of cyber weapons such as Stuxnet, which is described as crossing the Rubicon. In the short term, it sets back Iran’s nuclear program for less than the cost of conventional weapons, but it did not take long for Iran to build the fourth-biggest cyber army in the world. As spyware continues to proliferate, we envision a future dystopia where everyone spies on everyone. Nothing will be safe from hacking: not your identity, or your secrets, or your passwords, or your bank accounts. When the endpoints (phones) have been compromised, technologies such as end-to-end encryption and multi-factor authentication offer a false sense of security; encryption and authentication are as pointless as closing the proverbial barn door after the fact. To address Risks 3.0, journalists are using the tools of their trade to raise awareness in the court of public opinion. We should do what we can to support them. This paper is a small step in that direction.

摘要我们上一篇新兴趋势文章介绍了风险1.0（公平和偏见）和风险2.0（成瘾、危险、致命和疯狂盈利）。本文介绍了风险3.0（间谍软件和网络武器）。风险3.0的利润较低，但更具破坏性。我们将总结最近的两本书，《飞马座：口袋里的间谍如何威胁隐私、尊严和民主的终结》和《这就是他们如何告诉我世界末日：网络武器军备竞赛》。第一本书以一个名为Pegasus的间谍软件泄露的50000个电话号码开始。Pegasus使用零点击漏洞获得对手机的root访问权限，控制麦克风、摄像头、GPS、短信等。50000个号码包括记者、政治家和学者，以及他们的朋友和家人。其中一些人被谋杀了。第二本书描述了Stuxnet等网络武器的历史，它被描述为穿越卢比孔河。在短期内，它以低于常规武器的成本阻碍了伊朗的核计划，但没过多久，伊朗就建立了世界第四大网络军队。随着间谍软件的不断扩散，我们设想了一个未来的反乌托邦，每个人都在监视每个人。没有什么是安全的黑客：不是你的身份，你的秘密，你的密码，或你的银行账户。当端点（手机）受到威胁时，端到端加密和多因素身份验证等技术会提供虚假的安全感；加密和身份验证就像事后关上众所周知的谷仓门一样毫无意义。为了应对风险3.0，记者们正在使用他们的行业工具来提高公众舆论的认识。我们应该尽我们所能支持他们。这篇论文是朝着这个方向迈出的一小步。

{"title":"Emerging trends: Risks 3.0 and proliferation of spyware to 50,000 cell phones","authors":"Kenneth Ward Church, Raman Chandrasekar","doi":"10.1017/s1351324923000141","DOIUrl":"https://doi.org/10.1017/s1351324923000141","url":null,"abstract":"Abstract Our last emerging trend article introduced Risks 1.0 (fairness and bias) and Risks 2.0 (addictive, dangerous, deadly, and insanely profitable). This article introduces Risks 3.0 (spyware and cyber weapons). Risks 3.0 are less profitable, but more destructive. We will summarize two recent books, Pegasus: How a Spy in Your Pocket Threatens the End of Privacy, Dignity, and Democracy and This is How They Tell Me the World Ends: The Cyberweapons Arms Race. The first book starts with a leak of 50,000 phone numbers, targeted by spyware named Pegasus. Pegasus uses a zero-click exploit to obtain root access to your phone, taking control of the microphone, camera, GPS, text messages, etc. The list of 50,000 numbers includes journalists, politicians, and academics, as well as their friends and family. Some of these people have been murdered. The second book describes the history of cyber weapons such as Stuxnet, which is described as crossing the Rubicon. In the short term, it sets back Iran’s nuclear program for less than the cost of conventional weapons, but it did not take long for Iran to build the fourth-biggest cyber army in the world. As spyware continues to proliferate, we envision a future dystopia where everyone spies on everyone. Nothing will be safe from hacking: not your identity, or your secrets, or your passwords, or your bank accounts. When the endpoints (phones) have been compromised, technologies such as end-to-end encryption and multi-factor authentication offer a false sense of security; encryption and authentication are as pointless as closing the proverbial barn door after the fact. To address Risks 3.0, journalists are using the tools of their trade to raise awareness in the court of public opinion. We should do what we can to support them. This paper is a small step in that direction.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"824 - 841"},"PeriodicalIF":2.5,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43929076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task – CORRIGENDUM 误入歧途还是误入歧途：通过NLI任务探索多语言BERT的语言学知识——CORRIGENUM

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-04-28 DOI: 10.1017/S1351324923000116

M. Tikhonova, V. Mikhailov, D. Pisarevskaya, Valentin Malykh, Tatiana Shavrina

引用次数: 0

A transformer-based multi-task framework for joint detection of aggression and hate on social media data 基于transformer的多任务框架，用于联合检测社交媒体数据上的攻击和仇恨

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-04-11 DOI: 10.1017/s1351324923000104

Soumitra Ghosh, Amit Priyankar, Asif Ekbal, P. Bhattacharyya

Moderators often face a double challenge regarding reducing offensive and harmful content in social media. Despite the need to prevent the free circulation of such content, strict censorship on social media cannot be implemented due to a tricky dilemma – preserving free speech on the Internet while limiting them and how not to overreact. Existing systems do not essentially exploit the correlatedness of hate-offensive content and aggressive posts; instead, they attend to the tasks individually. As a result, the need for cost-effective, sophisticated multi-task systems to effectively detect aggressive and offensive content on social media is highly critical in recent times. This work presents a novel multifaceted transformer-based framework to identify aggressive and hate posts on social media. Through an end-to-end transformer-based multi-task network, our proposed approach addresses the following array of tasks: (a) aggression identification, (b) misogynistic aggression identification, (c) identifying hate-offensive and non-hate-offensive content, (d) identifying hate, profane, and offensive posts, (e) type of offense. We further investigate the role of emotion in improving the system’s overall performance by learning the task of emotion detection jointly with the other tasks. We evaluate our approach on two popular benchmark datasets of aggression and hate speech, covering four languages, and compare the system performance with various state-of-the-art methods. Results indicate that our multi-task system performs significantly well for all the tasks across multiple languages, outperforming several benchmark methods. Moreover, the secondary task of emotion detection substantially improves the system performance for all the tasks, indicating strong correlatedness among the tasks of aggression, hate, and emotion, thus opening avenues for future research.

版主经常面临双重挑战，即如何减少社交媒体上的攻击性和有害内容。尽管有必要阻止这些内容的自由传播，但对社交媒体的严格审查无法实施，因为一个棘手的困境——在保护互联网上的言论自由的同时限制它们，以及如何不过度反应。现有的系统基本上没有利用仇恨性内容和攻击性帖子的相关性;相反，他们会单独完成任务。因此，需要经济高效、复杂的多任务系统来有效地检测社交媒体上的攻击性和攻击性内容，这在最近是非常重要的。这项工作提出了一个新颖的基于多面转换器的框架，以识别社交媒体上的攻击性和仇恨帖子。通过端到端基于转换器的多任务网络，我们提出的方法解决了以下一系列任务:(a)攻击识别，(b)厌女攻击识别，(c)识别仇恨攻击性和非仇恨攻击性内容，(d)识别仇恨、亵渎和攻击性帖子，(e)攻击类型。通过学习情绪检测任务和其他任务，我们进一步研究了情绪在提高系统整体性能方面的作用。我们在两种流行的攻击性和仇恨言论基准数据集上评估了我们的方法，涵盖了四种语言，并将系统性能与各种最先进的方法进行了比较。结果表明，我们的多任务系统在跨多种语言的所有任务中都表现得非常好，优于几种基准方法。此外，情绪检测的次要任务大大提高了所有任务的系统性能，表明攻击、仇恨和情绪任务之间存在很强的相关性，从而为未来的研究开辟了道路。

{"title":"A transformer-based multi-task framework for joint detection of aggression and hate on social media data","authors":"Soumitra Ghosh, Amit Priyankar, Asif Ekbal, P. Bhattacharyya","doi":"10.1017/s1351324923000104","DOIUrl":"https://doi.org/10.1017/s1351324923000104","url":null,"abstract":"\u0000 Moderators often face a double challenge regarding reducing offensive and harmful content in social media. Despite the need to prevent the free circulation of such content, strict censorship on social media cannot be implemented due to a tricky dilemma – preserving free speech on the Internet while limiting them and how not to overreact. Existing systems do not essentially exploit the correlatedness of hate-offensive content and aggressive posts; instead, they attend to the tasks individually. As a result, the need for cost-effective, sophisticated multi-task systems to effectively detect aggressive and offensive content on social media is highly critical in recent times. This work presents a novel multifaceted transformer-based framework to identify aggressive and hate posts on social media. Through an end-to-end transformer-based multi-task network, our proposed approach addresses the following array of tasks: (a) aggression identification, (b) misogynistic aggression identification, (c) identifying hate-offensive and non-hate-offensive content, (d) identifying hate, profane, and offensive posts, (e) type of offense. We further investigate the role of emotion in improving the system’s overall performance by learning the task of emotion detection jointly with the other tasks. We evaluate our approach on two popular benchmark datasets of aggression and hate speech, covering four languages, and compare the system performance with various state-of-the-art methods. Results indicate that our multi-task system performs significantly well for all the tasks across multiple languages, outperforming several benchmark methods. Moreover, the secondary task of emotion detection substantially improves the system performance for all the tasks, indicating strong correlatedness among the tasks of aggression, hate, and emotion, thus opening avenues for future research.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43089267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

On generalization of the sense retrofitting model 感官改造模型的推广

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-03-31 DOI: 10.1017/S1351324922000523

Yang-Yin Lee, Ting-Yu Yen, Hen-Hsen Huang, Yow-Ting Shiue, Hsin-Hsi Chen

Abstract With the aid of recently proposed word embedding algorithms, the study of semantic relatedness has progressed rapidly. However, word-level representations are still lacking for many natural language processing tasks. Various sense-level embedding learning algorithms have been proposed to address this issue. In this paper, we present a generalized model derived from existing sense retrofitting models. In this generalization, we take into account semantic relations between the senses, relation strength, and semantic strength. Experimental results show that the generalized model outperforms previous approaches on four tasks: semantic relatedness, contextual word similarity, semantic difference, and synonym selection. Based on the generalized sense retrofitting model, we also propose a standardization process on the dimensions with four settings, a neighbor expansion process from the nearest neighbors, and combinations of these two approaches. Finally, we propose a Procrustes analysis approach that inspired from bilingual mapping models for learning representations that outside of the ontology. The experimental results show the advantages of these approaches on semantic relatedness tasks.

摘要在最近提出的单词嵌入算法的帮助下，语义相关性的研究进展迅速。然而，对于许多自然语言处理任务来说，单词级表示仍然缺乏。已经提出了各种感觉级嵌入学习算法来解决这个问题。在本文中，我们提出了一个广义模型，该模型是从现有的感官改造模型推导而来的。在这种概括中，我们考虑了感官之间的语义关系、关系强度和语义强度。实验结果表明，该广义模型在语义关联性、上下文词相似度、语义差异和同义词选择四个方面优于以往的方法。在广义意义改造模型的基础上，我们还提出了一个具有四个设置的维度标准化过程，一个从最近邻居扩展邻居的过程，以及这两种方法的组合。最后，我们提出了一种Procrustes分析方法，该方法的灵感来自双语映射模型，用于学习本体之外的表示。实验结果表明了这些方法在语义关联任务中的优势。

{"title":"On generalization of the sense retrofitting model","authors":"Yang-Yin Lee, Ting-Yu Yen, Hen-Hsen Huang, Yow-Ting Shiue, Hsin-Hsi Chen","doi":"10.1017/S1351324922000523","DOIUrl":"https://doi.org/10.1017/S1351324922000523","url":null,"abstract":"Abstract With the aid of recently proposed word embedding algorithms, the study of semantic relatedness has progressed rapidly. However, word-level representations are still lacking for many natural language processing tasks. Various sense-level embedding learning algorithms have been proposed to address this issue. In this paper, we present a generalized model derived from existing sense retrofitting models. In this generalization, we take into account semantic relations between the senses, relation strength, and semantic strength. Experimental results show that the generalized model outperforms previous approaches on four tasks: semantic relatedness, contextual word similarity, semantic difference, and synonym selection. Based on the generalized sense retrofitting model, we also propose a standardization process on the dimensions with four settings, a neighbor expansion process from the nearest neighbors, and combinations of these two approaches. Finally, we propose a Procrustes analysis approach that inspired from bilingual mapping models for learning representations that outside of the ontology. The experimental results show the advantages of these approaches on semantic relatedness tasks.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1097 - 1125"},"PeriodicalIF":2.5,"publicationDate":"2023-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41501051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The problem of varying annotations to identify abusive language in social media content 在社交媒体内容中使用不同的注释来识别辱骂性语言的问题

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-03-29 DOI: 10.1017/s1351324923000098

Nina Seemann, Yeong Su Lee, Julian Höllig, Michaela Geierhos

With the increase of user-generated content on social media, the detection of abusive language has become crucial and is therefore reflected in several shared tasks that have been performed in recent years. The development of automatic detection systems is desirable, and the classification of abusive social media content can be solved with the help of machine learning. The basis for successful development of machine learning models is the availability of consistently labeled training data. But a diversity of terms and definitions of abusive language is a crucial barrier. In this work, we analyze a total of nine datasets—five English and four German datasets—designed for detecting abusive online content. We provide a detailed description of the datasets, that is, for which tasks the dataset was created, how the data were collected, and its annotation guidelines. Our analysis shows that there is no standard definition of abusive language, which often leads to inconsistent annotations. As a consequence, it is difficult to draw cross-domain conclusions, share datasets, or use models for other abusive social media language tasks. Furthermore, our manual inspection of a random sample of each dataset revealed controversial examples. We highlight challenges in data annotation by discussing those examples, and present common problems in the annotation process, such as contradictory annotations and missing context information. Finally, to complement our theoretical work, we conduct generalization experiments on three German datasets.

随着社交媒体上用户生成内容的增加，检测辱骂性语言变得至关重要，因此反映在近年来执行的几项共享任务中。自动检测系统的开发是可取的，滥用社交媒体内容的分类可以在机器学习的帮助下解决。成功开发机器学习模型的基础是一致标记的训练数据的可用性。但滥用语言的术语和定义的多样性是一个关键障碍。在这项工作中，我们总共分析了九个数据集——五个英语数据集和四个德语数据集——旨在检测滥用在线内容。我们提供了数据集的详细描述，即数据集是为哪些任务创建的，数据是如何收集的，以及其注释指南。我们的分析表明，滥用语言没有标准的定义，这往往导致注释不一致。因此，很难得出跨领域的结论、共享数据集或将模型用于其他滥用社交媒体语言的任务。此外，我们对每个数据集的随机样本进行的手动检查揭示了有争议的例子。我们通过讨论这些例子强调了数据注释中的挑战，并提出了注释过程中的常见问题，如相互矛盾的注释和上下文信息缺失。最后，为了补充我们的理论工作，我们在三个德国数据集上进行了泛化实验。

{"title":"The problem of varying annotations to identify abusive language in social media content","authors":"Nina Seemann, Yeong Su Lee, Julian Höllig, Michaela Geierhos","doi":"10.1017/s1351324923000098","DOIUrl":"https://doi.org/10.1017/s1351324923000098","url":null,"abstract":"\u0000 With the increase of user-generated content on social media, the detection of abusive language has become crucial and is therefore reflected in several shared tasks that have been performed in recent years. The development of automatic detection systems is desirable, and the classification of abusive social media content can be solved with the help of machine learning. The basis for successful development of machine learning models is the availability of consistently labeled training data. But a diversity of terms and definitions of abusive language is a crucial barrier. In this work, we analyze a total of nine datasets—five English and four German datasets—designed for detecting abusive online content. We provide a detailed description of the datasets, that is, for which tasks the dataset was created, how the data were collected, and its annotation guidelines. Our analysis shows that there is no standard definition of abusive language, which often leads to inconsistent annotations. As a consequence, it is difficult to draw cross-domain conclusions, share datasets, or use models for other abusive social media language tasks. Furthermore, our manual inspection of a random sample of each dataset revealed controversial examples. We highlight challenges in data annotation by discussing those examples, and present common problems in the annotation process, such as contradictory annotations and missing context information. Finally, to complement our theoretical work, we conduct generalization experiments on three German datasets.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45813982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Syntactic n-grams in Computational Linguistics, by Grigori Sidorov. Cham, Springer Nature, 2019. ISBN 9783030147716. IX + 92 pages. 计算语言学中的句法n-grams, Grigori Sidorov著。《自然》，2019。ISBN 9783030147716。9 + 92页。

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-03-29 DOI: 10.1017/s1351324923000074

Fengyang Shi, Guohua Feng

引用次数: 0

Towards diverse and contextually anchored paraphrase modeling: A dataset and baselines for Finnish 走向多样化和基于上下文的转述建模：芬兰语的数据集和基线

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-03-16 DOI: 10.1017/s1351324923000086

Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Aurora Piirto, Jenna Saarni, Maija Sevón, Otto Tarkka

In this paper, we study natural language paraphrasing from both corpus creation and modeling points of view. We focus in particular on the methodology that allows the extraction of challenging examples of paraphrase pairs in their natural textual context, leading to a dataset potentially more suitable for evaluating the models’ ability to represent meaning, especially in document context, when compared with those gathered using various sentence-level heuristics. To this end, we introduce the Turku Paraphrase Corpus, the first large-scale, fully manually annotated corpus of paraphrases in Finnish. The corpus contains 104,645 manually labeled paraphrase pairs, of which 98% are verified to be true paraphrases, either universally or within their present context. In order to control the diversity of the paraphrase pairs and avoid certain biases easily introduced in automatic candidate extraction, the paraphrases are manually collected from different paraphrase-rich text sources. This allows us to create a challenging dataset including longer and more lexically diverse paraphrases than can be expected from those collected through heuristics. In addition to quality, this also allows us to keep the original document context for each pair, making it possible to study paraphrasing in context. To our knowledge, this is the first paraphrase corpus which provides the original document context for the annotated pairs. We also study several paraphrase models trained and evaluated on the new data. Our initial paraphrase classification experiments indicate a challenging nature of the dataset when classifying using the detailed labeling scheme used in the corpus annotation, the accuracy substantially lacking behind human performance. However, when evaluating the models on a large scale paraphrase retrieval task on almost 400M candidate sentences, the results are highly encouraging, 29–53% of the pairs being ranked in the top 10 depending on the paraphrase type. The Turku Paraphrase Corpus is available at github.com/TurkuNLP/Turku-paraphrase-corpus as well as through the popular HuggingFace datasets under the CC-BY-SA license.

本文从语料库创建和建模两个角度对自然语言转述进行了研究。我们特别关注的是允许在其自然文本上下文中提取具有挑战性的转述对示例的方法，从而形成一个数据集，与使用各种句子级启发式方法收集的数据集相比，该数据集可能更适合评估模型表示意义的能力，尤其是在文档上下文中。为此，我们介绍了第一个大规模的、完全手动注释的芬兰语转述语料库——图尔库转述语料库。语料库包含104645个人工标记的转述对，其中98%被证明是真实的转述，无论是普遍的还是在其当前上下文中。为了控制转述对的多样性，避免在自动候选提取中容易引入的某些偏差，转述是从不同的转述丰富的文本源中手动收集的。这使我们能够创建一个具有挑战性的数据集，其中包括比通过启发式方法收集的数据更长、更具词汇多样性的释义。除了质量之外，这还允许我们保留每一对的原始文档上下文，从而有可能在上下文中研究转述。据我们所知，这是第一个为注释对提供原始文档上下文的转述语料库。我们还研究了在新数据上训练和评估的几个转述模型。我们最初的转述分类实验表明，当使用语料库注释中使用的详细标记方案进行分类时，数据集具有挑战性，其准确性远远落后于人类表现。然而，在对近400M个候选句子进行大规模转述检索任务的情况下，对模型进行评估时，结果非常令人鼓舞，根据转述类型，29-53%的对被排在前10位。图尔库Paraphrase语料库可在github.com/TurkuNLP/Turku-rebread-Corpus上获得，也可通过CC-BY-SA许可证下的流行HuggingFace数据集获得。

{"title":"Towards diverse and contextually anchored paraphrase modeling: A dataset and baselines for Finnish","authors":"Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Aurora Piirto, Jenna Saarni, Maija Sevón, Otto Tarkka","doi":"10.1017/s1351324923000086","DOIUrl":"https://doi.org/10.1017/s1351324923000086","url":null,"abstract":"\u0000 In this paper, we study natural language paraphrasing from both corpus creation and modeling points of view. We focus in particular on the methodology that allows the extraction of challenging examples of paraphrase pairs in their natural textual context, leading to a dataset potentially more suitable for evaluating the models’ ability to represent meaning, especially in document context, when compared with those gathered using various sentence-level heuristics. To this end, we introduce the Turku Paraphrase Corpus, the first large-scale, fully manually annotated corpus of paraphrases in Finnish. The corpus contains 104,645 manually labeled paraphrase pairs, of which 98% are verified to be true paraphrases, either universally or within their present context. In order to control the diversity of the paraphrase pairs and avoid certain biases easily introduced in automatic candidate extraction, the paraphrases are manually collected from different paraphrase-rich text sources. This allows us to create a challenging dataset including longer and more lexically diverse paraphrases than can be expected from those collected through heuristics. In addition to quality, this also allows us to keep the original document context for each pair, making it possible to study paraphrasing in context. To our knowledge, this is the first paraphrase corpus which provides the original document context for the annotated pairs.\u0000 We also study several paraphrase models trained and evaluated on the new data. Our initial paraphrase classification experiments indicate a challenging nature of the dataset when classifying using the detailed labeling scheme used in the corpus annotation, the accuracy substantially lacking behind human performance. However, when evaluating the models on a large scale paraphrase retrieval task on almost 400M candidate sentences, the results are highly encouraging, 29–53% of the pairs being ranked in the top 10 depending on the paraphrase type. The Turku Paraphrase Corpus is available at github.com/TurkuNLP/Turku-paraphrase-corpus as well as through the popular HuggingFace datasets under the CC-BY-SA license.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44715547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Argumentation models and their use in corpus annotation: Practice, prospects, and challenges 论证模型及其在语料库注释中的应用：实践、前景和挑战

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-02-28 DOI: 10.1017/S1351324923000062

Henrique Lopes Cardoso, R. Sousa-Silva, Paula Carvalho, Bruno Martins

Abstract The study of argumentation is transversal to several research domains, from philosophy to linguistics, from the law to computer science and artificial intelligence. In discourse analysis, several distinct models have been proposed to harness argumentation, each with a different focus or aim. To analyze the use of argumentation in natural language, several corpora annotation efforts have been carried out, with a more or less explicit grounding on one of such theoretical argumentation models. In fact, given the recent growing interest in argument mining applications, argument-annotated corpora are crucial to train machine learning models in a supervised way. However, the proliferation of such corpora has led to a wide disparity in the granularity of the argument annotations employed. In this paper, we review the most relevant theoretical argumentation models, after which we survey argument annotation projects closely following those theoretical models. We also highlight the main simplifications that are often introduced in practice. Furthermore, we glimpse other annotation efforts that are not so theoretically grounded but instead follow a shallower approach. It turns out that most argument annotation projects make their own assumptions and simplifications, both in terms of the textual genre they focus on and in terms of adapting the adopted theoretical argumentation model for their own agenda. Issues of compatibility among argument-annotated corpora are discussed by looking at the problem from a syntactical, semantic, and practical perspective. Finally, we discuss current and prospective applications of models that take advantage of argument-annotated corpora.

从哲学到语言学，从法律到计算机科学和人工智能，论证的研究跨越了几个研究领域。在语篇分析中，人们提出了几种不同的模型来利用论证，每种模型都有不同的焦点或目的。为了分析自然语言中论证的使用，已经开展了几种语料库注释工作，这些工作或多或少地明确地基于其中一种理论论证模型。事实上，鉴于最近人们对参数挖掘应用的兴趣日益浓厚，带有参数注释的语料库对于以监督的方式训练机器学习模型至关重要。然而，这种语料库的激增导致了所使用的参数注释粒度的巨大差异。在本文中，我们回顾了最相关的理论论证模型，然后调查了与这些理论模型密切相关的论证注释项目。我们还强调了在实践中经常引入的主要简化。此外，我们还看到了其他一些注释工作，它们在理论上并不那么扎实，而是采用了一种较浅显的方法。事实证明，大多数论证注释项目都有自己的假设和简化，无论是在他们关注的文本类型方面，还是在为自己的议程调整所采用的理论论证模型方面。从语法、语义和实践的角度讨论了参数注释语料库之间的兼容性问题。最后，我们讨论了利用参数注释语料库的模型的当前和未来应用。

{"title":"Argumentation models and their use in corpus annotation: Practice, prospects, and challenges","authors":"Henrique Lopes Cardoso, R. Sousa-Silva, Paula Carvalho, Bruno Martins","doi":"10.1017/S1351324923000062","DOIUrl":"https://doi.org/10.1017/S1351324923000062","url":null,"abstract":"Abstract The study of argumentation is transversal to several research domains, from philosophy to linguistics, from the law to computer science and artificial intelligence. In discourse analysis, several distinct models have been proposed to harness argumentation, each with a different focus or aim. To analyze the use of argumentation in natural language, several corpora annotation efforts have been carried out, with a more or less explicit grounding on one of such theoretical argumentation models. In fact, given the recent growing interest in argument mining applications, argument-annotated corpora are crucial to train machine learning models in a supervised way. However, the proliferation of such corpora has led to a wide disparity in the granularity of the argument annotations employed. In this paper, we review the most relevant theoretical argumentation models, after which we survey argument annotation projects closely following those theoretical models. We also highlight the main simplifications that are often introduced in practice. Furthermore, we glimpse other annotation efforts that are not so theoretically grounded but instead follow a shallower approach. It turns out that most argument annotation projects make their own assumptions and simplifications, both in terms of the textual genre they focus on and in terms of adapting the adopted theoretical argumentation model for their own agenda. Issues of compatibility among argument-annotated corpora are discussed by looking at the problem from a syntactical, semantic, and practical perspective. Finally, we discuss current and prospective applications of models that take advantage of argument-annotated corpora.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1150 - 1187"},"PeriodicalIF":2.5,"publicationDate":"2023-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42757315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An end-to-end neural framework using coarse-to-fine-grained attention for overlapping relational triple extraction 一个端到端的神经框架，使用粗到细粒度的注意力进行重叠关系三重提取

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-02-21 DOI: 10.1017/S1351324923000050

Huizhe Su, Hao Wang, Xiangfeng Luo, Shaorong Xie

Abstract In recent years, the extraction of overlapping relations has received great attention in the field of natural language processing (NLP). However, most existing approaches treat relational triples in sentences as isolated, without considering the rich semantic correlations implied in the relational hierarchy. Extracting these overlapping relational triples is challenging, given the overlapping types are various and relatively complex. In addition, these approaches do not highlight the semantic information in the sentence from coarse-grained to fine-grained. In this paper, we propose an end-to-end neural framework based on a decomposition model that incorporates multi-granularity relational features for the extraction of overlapping triples. Our approach employs an attention mechanism that combines relational hierarchy information with multiple granularities and pretrained textual representations, where the relational hierarchies are constructed manually or obtained by unsupervised clustering. We found that the different hierarchy construction strategies have little effect on the final extraction results. Experimental results on two public datasets, NYT and WebNLG, show that our mode substantially outperforms the baseline system in extracting overlapping relational triples, especially for long-tailed relations.

近年来，重叠关系的提取在自然语言处理(NLP)领域受到了广泛关注。然而，大多数现有的方法将句子中的关系三元组视为孤立的，而没有考虑关系层次结构中隐含的丰富的语义相关性。考虑到重叠类型多种多样且相对复杂，提取这些重叠的关系三元组具有挑战性。此外，这些方法不能突出句子中从粗粒度到细粒度的语义信息。在本文中，我们提出了一个基于分解模型的端到端神经框架，该模型包含多粒度关系特征，用于提取重叠三元组。我们的方法采用了一种关注机制，将关系层次信息与多粒度和预训练的文本表示相结合，其中关系层次是手动构建的或通过无监督聚类获得的。我们发现不同的层次结构构建策略对最终的提取结果影响不大。在NYT和WebNLG两个公共数据集上的实验结果表明，我们的模型在提取重叠关系三元组方面明显优于基线系统，特别是对于长尾关系。

{"title":"An end-to-end neural framework using coarse-to-fine-grained attention for overlapping relational triple extraction","authors":"Huizhe Su, Hao Wang, Xiangfeng Luo, Shaorong Xie","doi":"10.1017/S1351324923000050","DOIUrl":"https://doi.org/10.1017/S1351324923000050","url":null,"abstract":"Abstract In recent years, the extraction of overlapping relations has received great attention in the field of natural language processing (NLP). However, most existing approaches treat relational triples in sentences as isolated, without considering the rich semantic correlations implied in the relational hierarchy. Extracting these overlapping relational triples is challenging, given the overlapping types are various and relatively complex. In addition, these approaches do not highlight the semantic information in the sentence from coarse-grained to fine-grained. In this paper, we propose an end-to-end neural framework based on a decomposition model that incorporates multi-granularity relational features for the extraction of overlapping triples. Our approach employs an attention mechanism that combines relational hierarchy information with multiple granularities and pretrained textual representations, where the relational hierarchies are constructed manually or obtained by unsupervised clustering. We found that the different hierarchy construction strategies have little effect on the final extraction results. Experimental results on two public datasets, NYT and WebNLG, show that our mode substantially outperforms the baseline system in extracting overlapping relational triples, especially for long-tailed relations.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1126 - 1149"},"PeriodicalIF":2.5,"publicationDate":"2023-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46882761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0