Natural Language Engineering最新文献_第6页

An unsupervised perplexity-based method for boilerplate removal 基于无监督困惑度的样本板去除方法

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-02-21 DOI: 10.1017/s1351324923000049

Marcos Fernández-Pichel, Manuel Prada-Corral, D. Losada, J. C. Pichel, Pablo Gamallo

The availability of large web-based corpora has led to significant advances in a wide range of technologies, including massive retrieval systems or deep neural networks. However, leveraging this data is challenging, since web content is plagued by the so-called boilerplate: ads, incomplete or noisy text and rests of the navigation structure, such as menus or navigation bars. In this work, we present a novel and efficient approach to extract useful and well-formed content from web-scraped data. Our approach takes advantage of Language Models and their implicit knowledge about correctly formed text, and we demonstrate here that perplexity is a valuable artefact that can contribute in terms of effectiveness and efficiency. As a matter of fact, the removal of noisy parts leads to lighter AI or search solutions that are effective and entail important reductions in resources spent. We exemplify here the usefulness of our method with two downstream tasks, search and classification, and a cleaning task. We also provide a Python package with pre-trained models and a web demo demonstrating the capabilities of our approach.

大型基于网络的语料库的可用性导致了一系列技术的重大进步，包括大规模检索系统或深度神经网络。然而，利用这些数据是具有挑战性的，因为网络内容受到所谓样板的困扰：广告、不完整或嘈杂的文本以及导航结构的其余部分，如菜单或导航栏。在这项工作中，我们提出了一种新颖有效的方法，从网络抓取的数据中提取有用且格式良好的内容。我们的方法利用了语言模型及其对正确形成的文本的隐含知识，我们在这里证明了困惑是一种有价值的人工制品，可以在有效性和效率方面做出贡献。事实上，去除有噪声的部分可以带来更轻的人工智能或搜索解决方案，这些解决方案是有效的，并大大减少了资源支出。我们在这里用两个下游任务，搜索和分类，以及一个清理任务来举例说明我们的方法的有用性。我们还提供了一个带有预训练模型的Python包和一个演示我们方法功能的网络演示。

{"title":"An unsupervised perplexity-based method for boilerplate removal","authors":"Marcos Fernández-Pichel, Manuel Prada-Corral, D. Losada, J. C. Pichel, Pablo Gamallo","doi":"10.1017/s1351324923000049","DOIUrl":"https://doi.org/10.1017/s1351324923000049","url":null,"abstract":"The availability of large web-based corpora has led to significant advances in a wide range of technologies, including massive retrieval systems or deep neural networks. However, leveraging this data is challenging, since web content is plagued by the so-called boilerplate: ads, incomplete or noisy text and rests of the navigation structure, such as menus or navigation bars. In this work, we present a novel and efficient approach to extract useful and well-formed content from web-scraped data. Our approach takes advantage of Language Models and their implicit knowledge about correctly formed text, and we demonstrate here that perplexity is a valuable artefact that can contribute in terms of effectiveness and efficiency. As a matter of fact, the removal of noisy parts leads to lighter AI or search solutions that are effective and entail important reductions in resources spent. We exemplify here the usefulness of our method with two downstream tasks, search and classification, and a cleaning task. We also provide a Python package with pre-trained models and a web demo demonstrating the capabilities of our approach.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47336299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How to do human evaluation: A brief introduction to user studies in NLP 如何进行人的评估：NLP中的用户研究简介

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-02-06 DOI: 10.1017/S1351324922000535

Hendrik Schuff, Lindsey Vanderlyn, Heike Adel, Ngoc Thang Vu

Abstract Many research topics in natural language processing (NLP), such as explanation generation, dialog modeling, or machine translation, require evaluation that goes beyond standard metrics like accuracy or F1 score toward a more human-centered approach. Therefore, understanding how to design user studies becomes increasingly important. However, few comprehensive resources exist on planning, conducting, and evaluating user studies for NLP, making it hard to get started for researchers without prior experience in the field of human evaluation. In this paper, we summarize the most important aspects of user studies and their design and evaluation, providing direct links to NLP tasks and NLP-specific challenges where appropriate. We (i) outline general study design, ethical considerations, and factors to consider for crowdsourcing, (ii) discuss the particularities of user studies in NLP, and provide starting points to select questionnaires, experimental designs, and evaluation methods that are tailored to the specific NLP tasks. Additionally, we offer examples with accompanying statistical evaluation code, to bridge the gap between theoretical guidelines and practical applications.

自然语言处理(NLP)中的许多研究课题，如解释生成、对话建模或机器翻译，都需要超越准确性或F1分数等标准指标的评估，以更加以人为中心的方法。因此，了解如何设计用户研究变得越来越重要。然而，关于NLP用户研究的规划、实施和评估的综合资源很少，这使得没有人类评估领域经验的研究人员很难开始。在本文中，我们总结了用户研究及其设计和评估的最重要方面，并在适当的地方提供了与NLP任务和NLP特定挑战的直接联系。我们(i)概述一般研究设计、伦理考虑和众包需要考虑的因素，(ii)讨论NLP中用户研究的特殊性，并提供选择针对特定NLP任务的问卷、实验设计和评估方法的起点。此外，我们提供了伴随统计评估代码的例子，以弥合理论指导和实际应用之间的差距。

{"title":"How to do human evaluation: A brief introduction to user studies in NLP","authors":"Hendrik Schuff, Lindsey Vanderlyn, Heike Adel, Ngoc Thang Vu","doi":"10.1017/S1351324922000535","DOIUrl":"https://doi.org/10.1017/S1351324922000535","url":null,"abstract":"Abstract Many research topics in natural language processing (NLP), such as explanation generation, dialog modeling, or machine translation, require evaluation that goes beyond standard metrics like accuracy or F1 score toward a more human-centered approach. Therefore, understanding how to design user studies becomes increasingly important. However, few comprehensive resources exist on planning, conducting, and evaluating user studies for NLP, making it hard to get started for researchers without prior experience in the field of human evaluation. In this paper, we summarize the most important aspects of user studies and their design and evaluation, providing direct links to NLP tasks and NLP-specific challenges where appropriate. We (i) outline general study design, ethical considerations, and factors to consider for crowdsourcing, (ii) discuss the particularities of user studies in NLP, and provide starting points to select questionnaires, experimental designs, and evaluation methods that are tailored to the specific NLP tasks. Additionally, we offer examples with accompanying statistical evaluation code, to bridge the gap between theoretical guidelines and practical applications.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1199 - 1222"},"PeriodicalIF":2.5,"publicationDate":"2023-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43935814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

NLP startup funding in 2022 2022年NLP启动资金

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-01-01 DOI: 10.1017/S1351324923000013

R. Dale

Abstract It’s no secret that the commercial application of NLP technologies has exploded in recent years. From chatbots and virtual assistants to machine translation and sentiment analysis, NLP technologies are now being used in a wide variety of applications across a range of industries. With the increasing demand for technologies that can process human language, investors have been eager to get a piece of the action. In this article, we look at NLP startup funding over the past year, identifying the applications and domains that have received investment.

近年来，自然语言处理技术的商业应用呈爆炸式增长，这已经不是什么秘密了。从聊天机器人和虚拟助手到机器翻译和情感分析，NLP技术现在被广泛用于各行各业的各种应用中。随着对人类语言处理技术的需求不断增加，投资者一直渴望从中分一杯羹。在本文中，我们将回顾过去一年的NLP启动资金，确定已获得投资的应用程序和领域。

引用次数: 1

SEN: A subword-based ensemble network for Chinese historical entity extraction SEN：一种用于汉语历史实体提取的基于子词的集成网络

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2022-12-22 DOI: 10.1017/S1351324922000493

Cheng Yan, Ruojiang Wang, Xiaoke Fang

Abstract Understanding various historical entity information (e.g., persons, locations, and time) plays a very important role in reasoning about the developments of historical events. With the increasing concern about the fields of digital humanities and natural language processing, named entity recognition (NER) provides a feasible solution for automatically extracting these entities from historical texts, especially in Chinese historical research. However, previous approaches are domain-specific, ineffective with relatively low accuracy, and non-interpretable, which hinders the development of NER in Chinese history. In this paper, we propose a new hybrid deep learning model called “subword-based ensemble network” (SEN), by incorporating subword information and a novel attention fusion mechanism. The experiments on a massive self-built Chinese historical corpus CMAG show that SEN has achieved the best with 93.87% for F1-micro and 89.70% for F1-macro, compared with other advanced models. Further investigation reveals that SEN has a strong generalization ability of NER on Chinese historical texts, which is not only relatively insensitive to the categories with fewer annotation labels (e.g., OFI) but can also accurately capture diverse local and global semantic relations. Our research demonstrates the effectiveness of the integration of subword information and attention fusion, which provides an inspiring solution for the practical use of entity extraction in the Chinese historical domain.

了解各种历史实体信息(如人物、地点和时间)对于推理历史事件的发展起着非常重要的作用。随着人们对数字人文和自然语言处理领域的日益关注，命名实体识别(NER)为从历史文本中自动提取这些实体提供了一种可行的解决方案，特别是在中国历史研究中。然而，以往的方法都是针对特定领域的，效率低，准确率低，且不可解释，阻碍了中国历史上NER的发展。本文提出了一种新的混合深度学习模型“基于子词的集成网络”(SEN)，该模型结合了子词信息和一种新的注意力融合机制。在大型自建汉语历史语料库CMAG上的实验表明，与其他先进模型相比，SEN模型在f1 -微观和f1 -宏观上的准确率分别为93.87%和89.70%，达到了最佳水平。进一步研究表明，SEN对中文历史文本具有较强的NER泛化能力，不仅对标注标签较少的类别(如OFI)相对不敏感，而且能够准确捕捉到多种局部和全局语义关系。我们的研究证明了子词信息与注意力融合的有效性，为中文历史领域实体提取的实际应用提供了一个鼓舞人心的解决方案。

{"title":"SEN: A subword-based ensemble network for Chinese historical entity extraction","authors":"Cheng Yan, Ruojiang Wang, Xiaoke Fang","doi":"10.1017/S1351324922000493","DOIUrl":"https://doi.org/10.1017/S1351324922000493","url":null,"abstract":"Abstract Understanding various historical entity information (e.g., persons, locations, and time) plays a very important role in reasoning about the developments of historical events. With the increasing concern about the fields of digital humanities and natural language processing, named entity recognition (NER) provides a feasible solution for automatically extracting these entities from historical texts, especially in Chinese historical research. However, previous approaches are domain-specific, ineffective with relatively low accuracy, and non-interpretable, which hinders the development of NER in Chinese history. In this paper, we propose a new hybrid deep learning model called “subword-based ensemble network” (SEN), by incorporating subword information and a novel attention fusion mechanism. The experiments on a massive self-built Chinese historical corpus CMAG show that SEN has achieved the best with 93.87% for F1-micro and 89.70% for F1-macro, compared with other advanced models. Further investigation reveals that SEN has a strong generalization ability of NER on Chinese historical texts, which is not only relatively insensitive to the categories with fewer annotation labels (e.g., OFI) but can also accurately capture diverse local and global semantic relations. Our research demonstrates the effectiveness of the integration of subword information and attention fusion, which provides an inspiring solution for the practical use of entity extraction in the Chinese historical domain.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1043 - 1065"},"PeriodicalIF":2.5,"publicationDate":"2022-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45250172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

KLAUS-Tr: Knowledge & learning-based unit focused arithmetic word problem solver for transfer cases KLAUS-Tr:知识和学习为基础的单元集中的算术字问题解决程序转移案例

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2022-12-22 DOI: 10.1017/s1351324922000511

Suresh Kumar, P. S. Kumar

Solving the Arithmetic Word Problems (AWPs) using AI techniques has attracted much attention in recent years. We feel that the current AWP solvers are under-utilizing the relevant domain knowledge. We present a knowledge- and learning-based system that effectively solves AWPs of a specific type—those that involve transfer of objects from one agent to another (Transfer Cases (TC)). We represent the knowledge relevant to these problems as TC Ontology. The sentences in TC-AWPs contain information of essentially four types: before-transfer, transfer, after-transfer, and query. Our system (KLAUS-Tr) uses statistical classifier to recognize the types of sentences. The sentence types guide the information extraction process used to identify the agents, quantities, units, types of objects, and the direction of transfer from the AWP text. The extracted information is represented as an RDF graph that utilizes the TC Ontology terminology. To solve the given AWP, we utilize semantic web rule language (SWRL) rules that capture the knowledge about how object transfer affects the RDF graph of the AWP. Using the TC ontology, we also analyze if the given problem is consistent or otherwise. The different ways in which TC-AWPs can be inconsistent are encoded as SWRL rules. Thus, KLAUS-Tr can identify if the given AWP is invalid and accordingly notify the user. Since the existing datasets do not have inconsistent AWPs, we create AWPs of this type and augment the datasets. We have implemented KLAUS-Tr and tested it on TC-type AWPs drawn from the All-Arith and other datasets. We find that TC-AWPs constitute about 40% of the AWPs in a typical dataset like All-Arith. Our system achieves an impressive accuracy of 92%, thus improving the state-of-the-art significantly. We plan to extend the system to handle AWPs that contain multiple transfers of objects and also offer explanations of the solutions.

近年来，利用人工智能技术解决算术字问题引起了人们的广泛关注。我们认为当前的AWP求解器没有充分利用相关的领域知识。我们提出了一个基于知识和学习的系统，它有效地解决了一种特定类型的awp——那些涉及对象从一个代理转移到另一个代理的awp(转移案例(TC))。我们将与这些问题相关的知识表示为TC本体。tc - awp中的句子基本上包含四种类型的信息:传递前、传递后、传递后和查询。我们的系统(KLAUS-Tr)使用统计分类器来识别句子的类型。句子类型指导信息提取过程，用于识别AWP文本的代理、数量、单位、对象类型和转移方向。提取的信息表示为使用TC本体术语的RDF图。为了解决给定的AWP，我们使用语义web规则语言(SWRL)规则来捕获关于对象传输如何影响AWP的RDF图的知识。使用TC本体，我们还分析了给定问题是否一致或不一致。tc - awp不一致的不同方式被编码为SWRL规则。因此，KLAUS-Tr可以识别给定的AWP是否无效，并相应地通知用户。由于现有数据集没有不一致的awp，我们创建这种类型的awp并扩展数据集。我们已经实现了KLAUS-Tr，并在来自All-Arith和其他数据集的tc型awp上进行了测试。我们发现，在All-Arith等典型数据集中，tc - awp约占awp的40%。我们的系统达到了令人印象深刻的92%的准确率，从而大大提高了最先进的技术。我们计划扩展该系统，以处理包含多个对象传输的awp，并提供解决方案的解释。

{"title":"KLAUS-Tr: Knowledge & learning-based unit focused arithmetic word problem solver for transfer cases","authors":"Suresh Kumar, P. S. Kumar","doi":"10.1017/s1351324922000511","DOIUrl":"https://doi.org/10.1017/s1351324922000511","url":null,"abstract":"\u0000 Solving the Arithmetic Word Problems (AWPs) using AI techniques has attracted much attention in recent years. We feel that the current AWP solvers are under-utilizing the relevant domain knowledge. We present a knowledge- and learning-based system that effectively solves AWPs of a specific type—those that involve transfer of objects from one agent to another (Transfer Cases (TC)). We represent the knowledge relevant to these problems as TC Ontology. The sentences in TC-AWPs contain information of essentially four types: before-transfer, transfer, after-transfer, and query. Our system (KLAUS-Tr) uses statistical classifier to recognize the types of sentences. The sentence types guide the information extraction process used to identify the agents, quantities, units, types of objects, and the direction of transfer from the AWP text. The extracted information is represented as an RDF graph that utilizes the TC Ontology terminology. To solve the given AWP, we utilize semantic web rule language (SWRL) rules that capture the knowledge about how object transfer affects the RDF graph of the AWP. Using the TC ontology, we also analyze if the given problem is consistent or otherwise. The different ways in which TC-AWPs can be inconsistent are encoded as SWRL rules. Thus, KLAUS-Tr can identify if the given AWP is invalid and accordingly notify the user. Since the existing datasets do not have inconsistent AWPs, we create AWPs of this type and augment the datasets. We have implemented KLAUS-Tr and tested it on TC-type AWPs drawn from the All-Arith and other datasets. We find that TC-AWPs constitute about 40% of the AWPs in a typical dataset like All-Arith. Our system achieves an impressive accuracy of 92%, thus improving the state-of-the-art significantly. We plan to extend the system to handle AWPs that contain multiple transfers of objects and also offer explanations of the solutions.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"1 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2022-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"57584428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Emerging trends: Unfair, biased, addictive, dangerous, deadly, and insanely profitable 新兴趋势:不公平，有偏见，令人上瘾，危险，致命和疯狂的利润

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2022-12-19 DOI: 10.1017/s1351324922000481

Kenneth Ward Church, Annika Marie Schoene, John E. Ortega, Raman Chandrasekar, Valia Kordoni

There has been considerable work recently in the natural language community and elsewhere on Responsible AI. Much of this work focuses on fairness and biases (henceforth Risks 1.0), following the 2016 best seller: Weapons of Math Destruction. Two books published in 2022, The Chaos Machine and Like, Comment, Subscribe, raise additional risks to public health/safety/security such as genocide, insurrection, polarized politics, vaccinations (henceforth, Risks 2.0). These books suggest that the use of machine learning to maximize engagement in social media has created a Frankenstein Monster that is exploiting human weaknesses with persuasive technology, the illusory truth effect, Pavlovian conditioning, and Skinner’s intermittent variable reinforcement. Just as we cannot expect tobacco companies to sell fewer cigarettes and prioritize public health ahead of profits, so too, it may be asking too much of companies (and countries) to stop trafficking in misinformation given that it is so effective and so insanely profitable (at least in the short term). Eventually, we believe the current chaos will end, like the lawlessness in Wild West, because chaos is bad for business. As computer scientists, this paper will summarize criticisms from other fields and focus on implications for computer science; we will not attempt to contribute to those other fields. There is quite a bit of work in computer science on these risks, especially on Risks 1.0 (bias and fairness), but more work is needed, especially on Risks 2.0 (addictive, dangerous, and deadly).

最近在自然语言社区和其他地方有大量关于负责任的人工智能的工作。在2016年的畅销书《数学毁灭武器》(Weapons of Math Destruction)之后，这本书的大部分内容都集中在公平和偏见上(下文简称《风险1.0》)。2022年出版的两本书，《混乱机器》和《喜欢，评论，订阅》，提出了对公共卫生/安全/安全的额外风险，如种族灭绝，叛乱，两极分化的政治，疫苗接种(从今往后，风险2.0)。这些书表明，利用机器学习来最大限度地提高社交媒体的参与度，创造了一个弗兰肯斯坦怪物，它利用有说服力的技术、虚幻的真相效应、巴甫洛夫条件反射和斯金纳的间歇性变量强化来利用人类的弱点。正如我们不能期望烟草公司减少香烟的销量，把公众健康放在利润之前，同样，鉴于虚假信息的贩运如此有效，而且如此有利可图(至少在短期内)，它可能对公司(和国家)提出了太多的要求，要求它们停止虚假信息的贩运。最终，我们相信目前的混乱会结束，就像狂野西部的无法无天一样，因为混乱对商业是不利的。作为计算机科学家，本文将总结来自其他领域的批评，并关注对计算机科学的影响;我们将不试图对那些其他领域作出贡献。在计算机科学领域，针对这些风险已经做了相当多的工作，尤其是在风险1.0(偏见和公平性)方面，但还需要做更多的工作，尤其是在风险2.0(上瘾、危险和致命)方面。

{"title":"Emerging trends: Unfair, biased, addictive, dangerous, deadly, and insanely profitable","authors":"Kenneth Ward Church, Annika Marie Schoene, John E. Ortega, Raman Chandrasekar, Valia Kordoni","doi":"10.1017/s1351324922000481","DOIUrl":"https://doi.org/10.1017/s1351324922000481","url":null,"abstract":"\u0000 There has been considerable work recently in the natural language community and elsewhere on Responsible AI. Much of this work focuses on fairness and biases (henceforth Risks 1.0), following the 2016 best seller: Weapons of Math Destruction. Two books published in 2022, The Chaos Machine and Like, Comment, Subscribe, raise additional risks to public health/safety/security such as genocide, insurrection, polarized politics, vaccinations (henceforth, Risks 2.0). These books suggest that the use of machine learning to maximize engagement in social media has created a Frankenstein Monster that is exploiting human weaknesses with persuasive technology, the illusory truth effect, Pavlovian conditioning, and Skinner’s intermittent variable reinforcement. Just as we cannot expect tobacco companies to sell fewer cigarettes and prioritize public health ahead of profits, so too, it may be asking too much of companies (and countries) to stop trafficking in misinformation given that it is so effective and so insanely profitable (at least in the short term). Eventually, we believe the current chaos will end, like the lawlessness in Wild West, because chaos is bad for business. As computer scientists, this paper will summarize criticisms from other fields and focus on implications for computer science; we will not attempt to contribute to those other fields. There is quite a bit of work in computer science on these risks, especially on Risks 1.0 (bias and fairness), but more work is needed, especially on Risks 2.0 (addictive, dangerous, and deadly).","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"02 1","pages":"483-508"},"PeriodicalIF":2.5,"publicationDate":"2022-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80068869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Parameter-efficient feature-based transfer for paraphrase identification 基于参数高效特征的意译识别转移

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2022-12-19 DOI: 10.1017/S135132492200050X

Xiaodong Liu, Rafal Rzepka, K. Araki

Abstract There are many types of approaches for Paraphrase Identification (PI), an NLP task of determining whether a sentence pair has equivalent semantics. Traditional approaches mainly consist of unsupervised learning and feature engineering, which are computationally inexpensive. However, their task performance is moderate nowadays. To seek a method that can preserve the low computational costs of traditional approaches but yield better task performance, we take an investigation into neural network-based transfer learning approaches. We discover that by improving the usage of parameters efficiently for feature-based transfer, our research goal can be accomplished. Regarding the improvement, we propose a pre-trained task-specific architecture. The fixed parameters of the pre-trained architecture can be shared by multiple classifiers with small additional parameters. As a result, the computational cost left involving parameter update is only generated from classifier-tuning: the features output from the architecture combined with lexical overlap features are fed into a single classifier for tuning. Furthermore, the pre-trained task-specific architecture can be applied to natural language inference and semantic textual similarity tasks as well. Such technical novelty leads to slight consumption of computational and memory resources for each task and is also conducive to power-efficient continual learning. The experimental results show that our proposed method is competitive with adapter-BERT (a parameter-efficient fine-tuning approach) over some tasks while consuming only 16% trainable parameters and saving 69-96% time for parameter update.

意译识别(释义识别)是一项确定句子对是否具有等效语义的NLP任务，有许多类型的方法。传统的方法主要包括无监督学习和特征工程，这些方法的计算成本不高。然而，目前他们的任务绩效一般。为了寻求一种既能保持传统方法的低计算成本，又能获得更好的任务性能的方法，我们对基于神经网络的迁移学习方法进行了研究。我们发现，通过有效地改进参数在基于特征的迁移中的使用，可以实现我们的研究目标。关于改进，我们提出了一个预先训练的任务特定架构。预训练体系结构的固定参数可以被多个分类器共享，附加参数较小。因此，涉及参数更新的剩余计算成本仅由分类器调优产生:结合词法重叠特征的架构输出的特征被输入到单个分类器中进行调优。此外，预训练的任务特定架构也可以应用于自然语言推理和语义文本相似性任务。这种技术的新颖性会导致每个任务的计算和内存资源的轻微消耗，并且还有助于节能的持续学习。实验结果表明，该方法在某些任务上与adapter-BERT(一种参数有效的微调方法)具有竞争力，同时只消耗16%的可训练参数，节省69-96%的参数更新时间。

{"title":"Parameter-efficient feature-based transfer for paraphrase identification","authors":"Xiaodong Liu, Rafal Rzepka, K. Araki","doi":"10.1017/S135132492200050X","DOIUrl":"https://doi.org/10.1017/S135132492200050X","url":null,"abstract":"Abstract There are many types of approaches for Paraphrase Identification (PI), an NLP task of determining whether a sentence pair has equivalent semantics. Traditional approaches mainly consist of unsupervised learning and feature engineering, which are computationally inexpensive. However, their task performance is moderate nowadays. To seek a method that can preserve the low computational costs of traditional approaches but yield better task performance, we take an investigation into neural network-based transfer learning approaches. We discover that by improving the usage of parameters efficiently for feature-based transfer, our research goal can be accomplished. Regarding the improvement, we propose a pre-trained task-specific architecture. The fixed parameters of the pre-trained architecture can be shared by multiple classifiers with small additional parameters. As a result, the computational cost left involving parameter update is only generated from classifier-tuning: the features output from the architecture combined with lexical overlap features are fed into a single classifier for tuning. Furthermore, the pre-trained task-specific architecture can be applied to natural language inference and semantic textual similarity tasks as well. Such technical novelty leads to slight consumption of computational and memory resources for each task and is also conducive to power-efficient continual learning. The experimental results show that our proposed method is competitive with adapter-BERT (a parameter-efficient fine-tuning approach) over some tasks while consuming only 16% trainable parameters and saving 69-96% time for parameter update.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1066 - 1096"},"PeriodicalIF":2.5,"publicationDate":"2022-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41557645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NLE volume 28 issue 6 Cover and Front matter NLE第28卷第6期封面和封面问题

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2022-11-01 DOI: 10.1017/s1351324922000468

R. Mitkov, B. Boguraev

whether trans-lation, computer science or engineering. Its is to the computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing original research articles on a broad range of topics - from text analy- sis, machine translation, information retrieval, speech processing and generation to integrated systems and multi-modal interfaces - it also publishes special issues on specific natural language processing methods, tasks or applications. The journal welcomes survey papers describing the state of the art of a specific topic. The Journal of Natural Language Engineering also publishes the popular Industry Watch and Emerging Trends columns as well as book reviews.

无论是翻译、计算机科学还是工程学。它是对计算语言学的研究和实现具有潜在现实用途的实际应用。除了发表广泛主题的原创研究文章-从文本分析，机器翻译，信息检索，语音处理和生成到集成系统和多模态接口-它还出版关于特定自然语言处理方法，任务或应用的特刊。本刊欢迎描述某一特定主题的研究现状的调查论文。《自然语言工程杂志》还出版流行的行业观察和新兴趋势专栏以及书评。

引用次数: 0

NLE volume 28 issue 6 Cover and Back matter NLE第28卷第6期封面和封底

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2022-11-01 DOI: 10.1017/s135132492200047x

引用次数: 0

Towards universal methods for fake news detection 走向通用的假新闻检测方法

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2022-10-26 DOI: 10.1017/s1351324922000456

M. Pszona, M. Janicka, Grzegorz Wojdyga, A. Wawer

Abstract Fake news detection is an emerging topic that has attracted a lot of attention among researchers and in the industry. This paper focuses on fake news detection as a text classification problem: on the basis of five publicly available corpora with documents labeled as true or fake, the task was to automatically distinguish both classes without relying on fact-checking. The aim of our research was to test the feasibility of a universal model: one that produces satisfactory results on all data sets tested in our article. We attempted to do so by training a set of classification models on one collection and testing them on another. As it turned out, this resulted in a sharp performance degradation. Therefore, this paper focuses on finding the most effective approach to utilizing information in a transferable manner. We examined a variety of methods: feature selection, machine learning approaches to data set shift (instance re-weighting and projection-based), and deep learning approaches based on domain transfer. These methods were applied to various feature spaces: linguistic and psycholinguistic, embeddings obtained from the Universal Sentence Encoder, and GloVe embeddings. A detailed analysis showed that some combinations of these methods and selected feature spaces bring significant improvements. When using linguistic data, feature selection yielded the best overall mean improvement (across all train-test pairs) of 4%. Among the domain adaptation methods, the greatest improvement of 3% was achieved by subspace alignment.

摘要假新闻检测是一个新兴的话题，引起了研究者和业界的广泛关注。本文将假新闻检测作为一个文本分类问题:在五个公开可用的语料库的基础上，将文档标记为真实或虚假，任务是在不依赖事实检查的情况下自动区分这两个类别。我们研究的目的是测试通用模型的可行性:在我们文章中测试的所有数据集上产生令人满意的结果。我们试图通过在一个集合上训练一组分类模型并在另一个集合上测试它们来做到这一点。事实证明，这导致了性能的急剧下降。因此，本文的重点是寻找以可转移的方式利用信息的最有效方法。我们研究了各种方法:特征选择，数据集移位的机器学习方法(实例重新加权和基于投影的)，以及基于域转移的深度学习方法。这些方法被应用于不同的特征空间:语言和心理语言，从通用句子编码器获得的嵌入，以及GloVe嵌入。详细的分析表明，这些方法和选择的特征空间的某些组合带来了显著的改进。当使用语言数据时，特征选择产生了4%的最佳总体平均改进(在所有训练-测试对中)。在领域自适应方法中，子空间对齐方法的改进幅度最大，达到3%。

{"title":"Towards universal methods for fake news detection","authors":"M. Pszona, M. Janicka, Grzegorz Wojdyga, A. Wawer","doi":"10.1017/s1351324922000456","DOIUrl":"https://doi.org/10.1017/s1351324922000456","url":null,"abstract":"Abstract Fake news detection is an emerging topic that has attracted a lot of attention among researchers and in the industry. This paper focuses on fake news detection as a text classification problem: on the basis of five publicly available corpora with documents labeled as true or fake, the task was to automatically distinguish both classes without relying on fact-checking. The aim of our research was to test the feasibility of a universal model: one that produces satisfactory results on all data sets tested in our article. We attempted to do so by training a set of classification models on one collection and testing them on another. As it turned out, this resulted in a sharp performance degradation. Therefore, this paper focuses on finding the most effective approach to utilizing information in a transferable manner. We examined a variety of methods: feature selection, machine learning approaches to data set shift (instance re-weighting and projection-based), and deep learning approaches based on domain transfer. These methods were applied to various feature spaces: linguistic and psycholinguistic, embeddings obtained from the Universal Sentence Encoder, and GloVe embeddings. A detailed analysis showed that some combinations of these methods and selected feature spaces bring significant improvements. When using linguistic data, feature selection yielded the best overall mean improvement (across all train-test pairs) of 4%. Among the domain adaptation methods, the greatest improvement of 3% was achieved by subspace alignment.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1004 - 1042"},"PeriodicalIF":2.5,"publicationDate":"2022-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45906028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0