Natural Language Engineering最新文献_第4页

Cluster-based ensemble learning model for improving sentiment classification of Arabic documents 基于聚类的阿拉伯语文档情感分类改进集成学习模型

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-06-01 DOI: 10.1017/s135132492300027x

Rana Husni Al Mahmoud, B. Hammo, Hossam Faris

This article reports on designing and implementing a multiclass sentiment classification approach to handle the imbalanced class distribution of Arabic documents. The proposed approach, sentiment classification of Arabic documents (SCArD), combines the advantages of a clustering-based undersampling (CBUS) method and an ensemble learning model to aid machine learning (ML) classifiers in building accurate models against highly imbalanced datasets. The CBUS method applies two standard clustering algorithms: K-means and expectation–maximization, to balance the ratio between the major and the minor classes by decreasing the number of the major class instances and maintaining the number of the minor class instances at the cluster level. The merits of the proposed approach are that it does not remove the majority class instances from the dataset nor injects the dataset with artificial minority class instances. The resulting balanced datasets are used to train two ML classifiers, random forest and updateable Naïve Bayes, to develop prediction data models. The best prediction data models are selected based on F1-score rates. We applied two techniques to test SCArD and generate new predictions from the imbalanced test dataset. The first technique uses the best prediction data models. The second technique uses the majority voting ensemble learning model, which combines the best prediction data models to generate the final predictions. The experimental results showed that SCArD is promising and outperformed the other comparative classification models based on the F1-score rates.

本文设计并实现了一种多类情感分类方法来处理阿拉伯文文档的不平衡类分布。本文提出的阿拉伯语文档情感分类(SCArD)方法结合了基于聚类的欠采样(CBUS)方法和集成学习模型的优点，以帮助机器学习(ML)分类器针对高度不平衡的数据集建立准确的模型。CBUS方法采用K-means和期望最大化两种标准聚类算法，通过减少主要类实例的数量和保持次要类实例的数量来平衡主要类和次要类之间的比例。该方法的优点是既不会从数据集中删除多数类实例，也不会在数据集中注入人为的少数类实例。得到的平衡数据集用于训练两个ML分类器，随机森林和可更新的Naïve贝叶斯，以开发预测数据模型。根据f1得分率选择最佳预测数据模型。我们应用了两种技术来测试SCArD，并从不平衡测试数据集生成新的预测。第一种技术使用最好的预测数据模型。第二种技术使用多数投票集成学习模型，它结合了最好的预测数据模型来生成最终的预测。实验结果表明，SCArD具有较好的应用前景，并且优于其他基于f1得分率的比较分类模型。

{"title":"Cluster-based ensemble learning model for improving sentiment classification of Arabic documents","authors":"Rana Husni Al Mahmoud, B. Hammo, Hossam Faris","doi":"10.1017/s135132492300027x","DOIUrl":"https://doi.org/10.1017/s135132492300027x","url":null,"abstract":"\u0000 This article reports on designing and implementing a multiclass sentiment classification approach to handle the imbalanced class distribution of Arabic documents. The proposed approach, sentiment classification of Arabic documents (SCArD), combines the advantages of a clustering-based undersampling (CBUS) method and an ensemble learning model to aid machine learning (ML) classifiers in building accurate models against highly imbalanced datasets. The CBUS method applies two standard clustering algorithms: K-means and expectation–maximization, to balance the ratio between the major and the minor classes by decreasing the number of the major class instances and maintaining the number of the minor class instances at the cluster level. The merits of the proposed approach are that it does not remove the majority class instances from the dataset nor injects the dataset with artificial minority class instances. The resulting balanced datasets are used to train two ML classifiers, random forest and updateable Naïve Bayes, to develop prediction data models. The best prediction data models are selected based on F1-score rates. We applied two techniques to test SCArD and generate new predictions from the imbalanced test dataset. The first technique uses the best prediction data models. The second technique uses the majority voting ensemble learning model, which combines the best prediction data models to generate the final predictions. The experimental results showed that SCArD is promising and outperformed the other comparative classification models based on the F1-score rates.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46482512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving semantic coverage of data-to-text generation model using dynamic memory networks 利用动态记忆网络改进数据到文本生成模型的语义覆盖

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-05-31 DOI: 10.1017/s1351324923000207

Elham Seifossadat, H. Sameti

This paper proposes a sequence-to-sequence model for data-to-text generation, called DM-NLG, to generate a natural language text from structured nonlinguistic input. Specifically, by adding a dynamic memory module to the attention-based sequence-to-sequence model, it can store the information that leads to generate previous output words and use it to generate the next word. In this way, the decoder part of the model is aware of all previous decisions, and as a result, the generation of duplicate words or incomplete semantic concepts is prevented. To improve the generated sentences quality by the DM-NLG decoder, a postprocessing step is performed using the pretrained language models. To prove the effectiveness of the DM-NLG model, we performed experiments on five different datasets and observed that our proposed model is able to reduce the slot error rate rate by 50% and improve the BLEU by 10%, compared to the state-of-the-art models.

本文提出了一种用于数据到文本生成的序列到序列模型，称为DM-NLG，用于从结构化非语言输入生成自然语言文本。具体来说，通过在基于注意力的序列到序列模型中添加动态存储模块，它可以存储导致生成先前输出单词的信息，并使用该信息生成下一个单词。通过这种方式，模型的解码器部分知道所有先前的决策，因此，防止了重复单词或不完整语义概念的生成。为了提高DM-NLG解码器生成的句子质量，使用预训练的语言模型执行后处理步骤。为了证明DM-NLG模型的有效性，我们在五个不同的数据集上进行了实验，观察到与最先进的模型相比，我们提出的模型能够将时隙错误率降低50%，并将BLEU提高10%。

引用次数: 1

Urdu paraphrase detection: A novel DNN-based implementation using a semi-automatically generated corpus 乌尔都语释义检测:一种新的基于dnn的实现，使用半自动生成的语料库

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-05-29 DOI: 10.1017/s1351324923000189

Hafiz Rizwan Iqbal, Rashad Maqsood, Agha Ali Raza, Saeed-Ul Hassan

Automatic paraphrase detection is the task of measuring the semantic overlap between two given texts. A major hurdle in the development and evaluation of paraphrase detection approaches, particularly for South Asian languages like Urdu, is the inadequacy of standard evaluation resources. The very few available paraphrased corpora for these languages are manually created. As a result, they are constrained to smaller sizes and are not very feasible to evaluate mainstream data-driven and deep neural networks (DNNs)-based approaches. Consequently, there is a need to develop semi- or fully automated corpus generation approaches for the resource-scarce languages. There is currently no semi- or fully automatically generated sentence-level Urdu paraphrase corpus. Moreover, no study is available to localize and compare approaches for Urdu paraphrase detection that focus on various mainstream deep neural architectures and pretrained language models. This research study addresses this problem by presenting a semi-automatic pipeline for generating paraphrased corpora for Urdu. It also presents a corpus that is generated using the proposed approach. This corpus contains 3147 semi-automatically extracted Urdu sentence pairs that are manually tagged as paraphrased (854) and non-paraphrased (2293). Finally, this paper proposes two novel approaches based on DNNs for the task of paraphrase detection in Urdu text. These are Word Embeddings n-gram Overlap (henceforth called WENGO), and a modified approach, Deep Text Reuse and Paraphrase Plagiarism Detection (henceforth called D-TRAPPD). Both of these approaches have been evaluated on two related tasks: (i) paraphrase detection, and (ii) text reuse and plagiarism detection. The results from these evaluations revealed that D-TRAPPD ( $F_1 = 96.80$ for paraphrase detection and $F_1 = 88.90$ for text reuse and plagiarism detection) outperformed WENGO ( $F_1 = 81.64$ for paraphrase detection and $F_1 = 61.19$ for text reuse and plagiarism detection) as well as other state-of-the-art approaches for these two tasks. The corpus, models, and our implementations have been made available as free to download for the research community.

自动释义检测是测量给定文本之间语义重叠的任务。开发和评估释义检测方法的一个主要障碍，特别是对于乌尔都语等南亚语言，是标准评估资源的不足。这些语言的极少数可用的意译语料库是手动创建的。因此，它们被限制在较小的尺寸上，并且不太适合评估主流的数据驱动和基于深度神经网络(dnn)的方法。因此，有必要为资源稀缺的语言开发半自动或全自动的语料库生成方法。目前还没有半自动或全自动生成的句子级乌尔都语释义语料库。此外，目前还没有针对各种主流深度神经结构和预训练语言模型的乌尔都语释义检测方法进行本地化和比较的研究。本研究通过提出一种半自动管道生成乌尔都语意译语料库来解决这一问题。它还提供了使用所建议的方法生成的语料库。该语料库包含3147个半自动提取的乌尔都语句子对，这些句子对被手动标记为释义(854)和非释义(2293)。最后，本文提出了两种基于深度神经网络的乌尔都语文本释义检测方法。它们是词嵌入n图重叠(以后称为WENGO)和一种改进的方法，深度文本重用和释义抄袭检测(以后称为D-TRAPPD)。这两种方法都在两个相关的任务上进行了评估:(i)释义检测，(ii)文本重用和抄袭检测。结果表明，D-TRAPPD(释义检测$F_1 = 96.80美元，文本重用和抄袭检测$F_1 = 88.90美元)在这两项任务上优于WENGO(释义检测$F_1 = 81.64美元，文本重用和抄袭检测$F_1 = 61.19美元)和其他最先进的方法。语料库、模型和我们的实现已经免费提供给研究社区下载。

{"title":"Urdu paraphrase detection: A novel DNN-based implementation using a semi-automatically generated corpus","authors":"Hafiz Rizwan Iqbal, Rashad Maqsood, Agha Ali Raza, Saeed-Ul Hassan","doi":"10.1017/s1351324923000189","DOIUrl":"https://doi.org/10.1017/s1351324923000189","url":null,"abstract":"\u0000 Automatic paraphrase detection is the task of measuring the semantic overlap between two given texts. A major hurdle in the development and evaluation of paraphrase detection approaches, particularly for South Asian languages like Urdu, is the inadequacy of standard evaluation resources. The very few available paraphrased corpora for these languages are manually created. As a result, they are constrained to smaller sizes and are not very feasible to evaluate mainstream data-driven and deep neural networks (DNNs)-based approaches. Consequently, there is a need to develop semi- or fully automated corpus generation approaches for the resource-scarce languages. There is currently no semi- or fully automatically generated sentence-level Urdu paraphrase corpus. Moreover, no study is available to localize and compare approaches for Urdu paraphrase detection that focus on various mainstream deep neural architectures and pretrained language models.\u0000 This research study addresses this problem by presenting a semi-automatic pipeline for generating paraphrased corpora for Urdu. It also presents a corpus that is generated using the proposed approach. This corpus contains 3147 semi-automatically extracted Urdu sentence pairs that are manually tagged as paraphrased (854) and non-paraphrased (2293). Finally, this paper proposes two novel approaches based on DNNs for the task of paraphrase detection in Urdu text. These are Word Embeddings n-gram Overlap (henceforth called WENGO), and a modified approach, Deep Text Reuse and Paraphrase Plagiarism Detection (henceforth called D-TRAPPD). Both of these approaches have been evaluated on two related tasks: (i) paraphrase detection, and (ii) text reuse and plagiarism detection. The results from these evaluations revealed that D-TRAPPD (\u0000 \u0000 \u0000 \u0000$F_1 = 96.80$\u0000\u0000 \u0000 for paraphrase detection and \u0000 \u0000 \u0000 \u0000$F_1 = 88.90$\u0000\u0000 \u0000 for text reuse and plagiarism detection) outperformed WENGO (\u0000 \u0000 \u0000 \u0000$F_1 = 81.64$\u0000\u0000 \u0000 for paraphrase detection and \u0000 \u0000 \u0000 \u0000$F_1 = 61.19$\u0000\u0000 \u0000 for text reuse and plagiarism detection) as well as other state-of-the-art approaches for these two tasks. The corpus, models, and our implementations have been made available as free to download for the research community.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47763438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Plot extraction and the visualization of narrative flow 情节提取与叙事流的可视化

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-05-23 DOI: 10.1017/s1351324923000232

Michael DeBuse, Sean Warnick

This article discusses the development of an automated plot extraction system for narrative texts. Acknowledging the distinction between plot, as an object of study with its own rich history and literature, and features of a text that may be automatically extractable, we begin by characterizing a text’s scatter plot of entities. This visualization of a text reveals entity density patterns characterizing the particular telling of the story under investigation and leads to effective scene partitioning. We then introduce the concept of narrative flow, a graph representation of the narrative ordering of scenes (the syuzhet) that includes how entities move through scenes from the text, and investigate the degree to which narrative flow can be automatically extracted given a glossary of plot-important objects, actors, and locations. Our subsequent analysis then explores the correlation between subjective notions of plot and the information extracted through these visualizations. In particular, we discuss narrative structures commonly found within the graphs and make comparisons with ground truth narrative flow graphs, showing mixed results highlighting the difficulty of plot extraction. However, the visual artifacts and common structural relationships seen in the graphs provide insight into narrative and its underlying plot.

本文讨论了叙事文本自动情节提取系统的开发。作为具有丰富历史和文学的研究对象，我们认识到情节与可以自动提取的文本特征之间的区别，我们首先描述文本实体的散点图。文本的可视化揭示了实体密度模式，表征了正在调查的故事的特定讲述，并导致有效的场景划分。然后，我们引入了叙事流的概念，即场景叙事顺序的图形表示(syuzhet)，其中包括实体如何从文本中穿过场景，并研究了给定情节重要对象、演员和位置的词汇表，叙事流可以自动提取的程度。我们随后的分析探讨了主观概念与通过这些可视化提取的信息之间的相关性。特别地，我们讨论了图中常见的叙事结构，并与地面真相叙事流图进行了比较，显示出复杂的结果，突出了情节提取的难度。然而，图表中的视觉人工制品和常见的结构关系提供了对叙事及其潜在情节的洞察。

{"title":"Plot extraction and the visualization of narrative flow","authors":"Michael DeBuse, Sean Warnick","doi":"10.1017/s1351324923000232","DOIUrl":"https://doi.org/10.1017/s1351324923000232","url":null,"abstract":"\u0000 This article discusses the development of an automated plot extraction system for narrative texts. Acknowledging the distinction between plot, as an object of study with its own rich history and literature, and features of a text that may be automatically extractable, we begin by characterizing a text’s scatter plot of entities. This visualization of a text reveals entity density patterns characterizing the particular telling of the story under investigation and leads to effective scene partitioning. We then introduce the concept of narrative flow, a graph representation of the narrative ordering of scenes (the syuzhet) that includes how entities move through scenes from the text, and investigate the degree to which narrative flow can be automatically extracted given a glossary of plot-important objects, actors, and locations. Our subsequent analysis then explores the correlation between subjective notions of plot and the information extracted through these visualizations. In particular, we discuss narrative structures commonly found within the graphs and make comparisons with ground truth narrative flow graphs, showing mixed results highlighting the difficulty of plot extraction. However, the visual artifacts and common structural relationships seen in the graphs provide insight into narrative and its underlying plot.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47790134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recommending tasks based on search queries and missions 根据搜索查询和任务推荐任务

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-05-17 DOI: 10.1017/s1351324923000219

Darío Garigliotti, K. Balog, K. Hose, Johannes Bjerva

Web search is an experience that naturally lends itself to recommendations, including query suggestions and related entities. In this article, we propose to recommend specific tasks to users, based on their search queries, such as planning a holiday trip or organizing a party. Specifically, we introduce the problem of query-based task recommendation and develop methods that combine well-established term-based ranking techniques with continuous semantic representations, including sentence representations from several transformer-based models. Using a purpose-built test collection, we find that our method is able to significantly outperform a strong text-based baseline. Further, we extend our approach to using a set of queries that all share the same underlying task, referred to as search mission, as input. The study is rounded off with a detailed feature and query analysis.

网络搜索是一种自然适合推荐的体验，包括查询建议和相关实体。在本文中，我们建议根据用户的搜索查询向用户推荐特定任务，例如计划度假旅行或组织聚会。具体来说，我们介绍了基于查询的任务推荐问题，并开发了将成熟的基于术语的排名技术与连续语义表示相结合的方法，包括来自几个基于转换器的模型的句子表示。使用专门构建的测试集合，我们发现我们的方法能够显著优于基于文本的强基线。此外，我们将我们的方法扩展到使用一组查询作为输入，这些查询都共享相同的底层任务，称为搜索任务。该研究以详细的特征和查询分析为结尾。

引用次数: 0

Joint learning of text alignment and abstractive summarization for long documents via unbalanced optimal transport 基于非平衡最优传输的长文档文本对齐与抽象摘要联合学习

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-05-15 DOI: 10.1017/s1351324923000177

Xin Shen, Wai Lam, Shumin Ma, Huadong Wang

Recently, neural abstractive text summarization (NATS) models based on sequence-to-sequence architecture have drawn a lot of attention. Real-world texts that need to be summarized range from short news with dozens of words to long reports with thousands of words. However, most existing NATS models are not good at summarizing long documents, due to the inherent limitations of their underlying neural architectures. In this paper, we focus on the task of long document summarization (LDS). Based on the inherent section structures of source documents, we divide an abstractive LDS problem into several smaller-sized problems. In this circumstance, how to provide a less-biased target summary as the supervision for each section is vital for the model’s performance. As a preliminary, we formally describe the section-to-summary-sentence (S2SS) alignment for LDS. Based on this, we propose a novel NATS framework for the LDS task. Our framework is built based on the theory of unbalanced optimal transport (UOT), and it is named as UOTSumm. It jointly learns three targets in a unified training objective, including the optimal S2SS alignment, a section-level NATS summarizer, and the number of aligned summary sentences for each section. In this way, UOTSumm directly learns the text alignment from summarization data, without resorting to any biased tool such as ROUGE. UOTSumm can be easily adapted to most existing NATS models. And we implement two versions of UOTSumm, with and without the pretrain-finetune technique. We evaluate UOTSumm on three publicly available LDS benchmarks: PubMed, arXiv, and GovReport. UOTSumm obviously outperforms its counterparts that use ROUGE for the text alignment. When combined with UOTSumm, the performance of two vanilla NATS models improves by a large margin. Besides, UOTSumm achieves better or comparable performance when compared with some recent strong baselines.

近年来，基于序列到序列结构的神经抽象文本摘要（NATS）模型引起了人们的广泛关注。需要总结的真实世界文本从几十字的短新闻到数千字的长报道。然而，由于其底层神经架构的固有局限性，大多数现有的NATS模型不善于总结长文档。在本文中，我们重点研究了长文档摘要（LDS）的任务。基于源文档固有的节结构，我们将抽象LDS问题划分为几个较小的问题。在这种情况下，如何提供一个偏差较小的目标摘要作为对每个部分的监督，对模型的性能至关重要。作为一个初步的，我们正式描述了LDS的部分到摘要句子（S2SS）的对齐。在此基础上，我们为LDS任务提出了一个新的NATS框架。我们的框架是基于不平衡最优运输理论建立的，它被命名为UOTSumm。它在一个统一的训练目标中联合学习三个目标，包括最佳S2SS对齐、节级NATS汇总器以及每个节的对齐汇总语句数量。通过这种方式，UOTSumm直接从摘要数据中学习文本对齐，而无需求助于任何有偏见的工具，如ROUGE。UOTSumm可以很容易地适应大多数现有的NATS模型。我们实现了两个版本的UOTSumm，有和没有预训练微调技术。我们根据三个公开的LDS基准评估UOTSumm：PubMed、arXiv和GovReport。UOTSumm明显优于使用ROUGE进行文本对齐的同行。当与UOTSumm相结合时，两个普通NATS模型的性能有了很大的提高。此外，与最近的一些强基线相比，UOTSumm实现了更好或可比的性能。

{"title":"Joint learning of text alignment and abstractive summarization for long documents via unbalanced optimal transport","authors":"Xin Shen, Wai Lam, Shumin Ma, Huadong Wang","doi":"10.1017/s1351324923000177","DOIUrl":"https://doi.org/10.1017/s1351324923000177","url":null,"abstract":"\u0000 Recently, neural abstractive text summarization (NATS) models based on sequence-to-sequence architecture have drawn a lot of attention. Real-world texts that need to be summarized range from short news with dozens of words to long reports with thousands of words. However, most existing NATS models are not good at summarizing long documents, due to the inherent limitations of their underlying neural architectures. In this paper, we focus on the task of long document summarization (LDS). Based on the inherent section structures of source documents, we divide an abstractive LDS problem into several smaller-sized problems. In this circumstance, how to provide a less-biased target summary as the supervision for each section is vital for the model’s performance. As a preliminary, we formally describe the section-to-summary-sentence (S2SS) alignment for LDS. Based on this, we propose a novel NATS framework for the LDS task. Our framework is built based on the theory of unbalanced optimal transport (UOT), and it is named as UOTSumm. It jointly learns three targets in a unified training objective, including the optimal S2SS alignment, a section-level NATS summarizer, and the number of aligned summary sentences for each section. In this way, UOTSumm directly learns the text alignment from summarization data, without resorting to any biased tool such as ROUGE. UOTSumm can be easily adapted to most existing NATS models. And we implement two versions of UOTSumm, with and without the pretrain-finetune technique. We evaluate UOTSumm on three publicly available LDS benchmarks: PubMed, arXiv, and GovReport. UOTSumm obviously outperforms its counterparts that use ROUGE for the text alignment. When combined with UOTSumm, the performance of two vanilla NATS models improves by a large margin. Besides, UOTSumm achieves better or comparable performance when compared with some recent strong baselines.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"1 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41342362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Determining sentiment views of verbal multiword expressions using linguistic features 利用语言特征确定多词言语表达的情感观

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-05-15 DOI: 10.1017/s1351324923000153

Michael Wiegand, Marc Schulder, Josef Ruppenhofer

We examine the binary classification of sentiment views for verbal multiword expressions (MWEs). Sentiment views denote the perspective of the holder of some opinion. We distinguish between MWEs conveying the view of the speaker of the utterance (e.g., in “The company reinvented the wheel” the holder is the implicit speaker who criticizes the company for creating something already existing) and MWEs conveying the view of explicit entities participating in an opinion event (e.g., in “Peter threw in the towel” the holder is Peter having given up something). The task has so far been examined on unigram opinion words. Since many features found effective for unigrams are not usable for MWEs, we propose novel ones taking into account the internal structure of MWEs, a unigram sentiment-view lexicon and various information from Wiktionary. We also examine distributional methods and show that the corpus on which a representation is induced has a notable impact on the classification. We perform an extrinsic evaluation in the task of opinion holder extraction and show that the learnt knowledge also improves a state-of-the-art classifier trained on BERT. Sentiment-view classification is typically framed as a task in which only little labeled training data are available. As in the case of unigrams, we show that for MWEs a feature-based approach beats state-of-the-art generic methods.

我们研究了言语多词表达情感观的二元分类。情绪观点表示持有某种观点的人的观点。我们区分传达话语说话者观点的MWE（例如，在“公司重新发明了轮子”中，持有者是批评公司创造了已经存在的东西的隐含说话者）和传达参与意见事件的明确实体观点的MWEs（例如，“Peter认输了”持有者是彼得放弃了什么）。到目前为止，这项任务已经在unigram意见词上进行了研究。由于许多对unigram有效的特征对MWE不可用，我们提出了新的特征，考虑到MWE的内部结构、unigram情感视图词典和Wiktionary的各种信息。我们还研究了分布方法，并表明在其上诱导表示的语料库对分类有显著影响。我们在意见持有者提取任务中进行了外部评估，并表明所学习的知识也改进了在BERT上训练的最先进的分类器。情绪视图分类通常被定义为一项任务，其中只有很少的标记训练数据可用。与unigram的情况一样，我们表明，对于MWE，基于特征的方法胜过最先进的通用方法。

{"title":"Determining sentiment views of verbal multiword expressions using linguistic features","authors":"Michael Wiegand, Marc Schulder, Josef Ruppenhofer","doi":"10.1017/s1351324923000153","DOIUrl":"https://doi.org/10.1017/s1351324923000153","url":null,"abstract":"\u0000 We examine the binary classification of sentiment views for verbal multiword expressions (MWEs). Sentiment views denote the perspective of the holder of some opinion. We distinguish between MWEs conveying the view of the speaker of the utterance (e.g., in “The company reinvented the wheel” the holder is the implicit speaker who criticizes the company for creating something already existing) and MWEs conveying the view of explicit entities participating in an opinion event (e.g., in “Peter threw in the towel” the holder is Peter having given up something). The task has so far been examined on unigram opinion words. Since many features found effective for unigrams are not usable for MWEs, we propose novel ones taking into account the internal structure of MWEs, a unigram sentiment-view lexicon and various information from Wiktionary. We also examine distributional methods and show that the corpus on which a representation is induced has a notable impact on the classification. We perform an extrinsic evaluation in the task of opinion holder extraction and show that the learnt knowledge also improves a state-of-the-art classifier trained on BERT. Sentiment-view classification is typically framed as a task in which only little labeled training data are available. As in the case of unigrams, we show that for MWEs a feature-based approach beats state-of-the-art generic methods.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44157069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

What should be encoded by position embedding for neural network language models? 神经网络语言模型的位置嵌入应该编码什么?

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-05-10 DOI: 10.1017/s1351324923000128

Shuiyuan Yu, Zihao Zhang, Haitao Liu

Word order is one of the most important grammatical devices and the basis for language understanding. However, as one of the most popular NLP architectures, Transformer does not explicitly encode word order. A solution to this problem is to incorporate position information by means of position encoding/embedding (PE). Although a variety of methods of incorporating position information have been proposed, the NLP community is still in want of detailed statistical researches on position information in real-life language. In order to understand the influence of position information on the correlation between words in more detail, we investigated the factors that affect the frequency of words and word sequences in large corpora. Our results show that absolute position, relative position, being at one of the two ends of a sentence and sentence length all significantly affect the frequency of words and word sequences. Besides, we observed that the frequency distribution of word sequences over relative position carries valuable grammatical information. Our study suggests that in order to accurately capture word–word correlations, it is not enough to focus merely on absolute and relative position. Transformers should have access to more types of position-related information which may require improvements to the current architecture.

语序是最重要的语法手段之一，是语言理解的基础。然而，作为最流行的NLP体系结构之一，Transformer并不显式地对词序进行编码。解决这一问题的一种方法是采用位置编码/嵌入(PE)的方法来合并位置信息。虽然已经提出了多种纳入位置信息的方法，但NLP界仍然需要对现实语言中的位置信息进行详细的统计研究。为了更详细地了解位置信息对词间相关性的影响，我们对大型语料库中影响词频率和词序列的因素进行了研究。我们的研究结果表明，绝对位置、相对位置、处于句子两端之一和句子长度都显著影响单词和单词序列的频率。此外，我们观察到单词序列在相对位置上的频率分布携带有价值的语法信息。我们的研究表明，为了准确地捕捉词与词之间的相关性，仅仅关注绝对和相对位置是不够的。变压器应该能够访问更多类型的位置相关信息，这可能需要改进当前的架构。

{"title":"What should be encoded by position embedding for neural network language models?","authors":"Shuiyuan Yu, Zihao Zhang, Haitao Liu","doi":"10.1017/s1351324923000128","DOIUrl":"https://doi.org/10.1017/s1351324923000128","url":null,"abstract":"\u0000 Word order is one of the most important grammatical devices and the basis for language understanding. However, as one of the most popular NLP architectures, Transformer does not explicitly encode word order. A solution to this problem is to incorporate position information by means of position encoding/embedding (PE). Although a variety of methods of incorporating position information have been proposed, the NLP community is still in want of detailed statistical researches on position information in real-life language. In order to understand the influence of position information on the correlation between words in more detail, we investigated the factors that affect the frequency of words and word sequences in large corpora. Our results show that absolute position, relative position, being at one of the two ends of a sentence and sentence length all significantly affect the frequency of words and word sequences. Besides, we observed that the frequency distribution of word sequences over relative position carries valuable grammatical information. Our study suggests that in order to accurately capture word–word correlations, it is not enough to focus merely on absolute and relative position. Transformers should have access to more types of position-related information which may require improvements to the current architecture.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42388065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SAN-T2T: An automated table-to-text generator based on selective attention network SAN-T2T:基于选择性注意网络的自动表到文本生成器

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-05-05 DOI: 10.1017/s135132492300013x

Haijie Ding, Xiaolong Xu

Table-to-text generation aims to generate descriptions for structured data (i.e., tables) and has been applied in many fields like question-answering systems and search engines. Current approaches mostly use neural language models to learn alignment between output and input based on the attention mechanisms, which are still flawed by the gradual weakening of attention when processing long texts and the inability to utilize the records’ structural information. To solve these problems, we propose a novel generative model SAN-T2T, which consists of a field-content selective encoder and a descriptive decoder, connected with a selective attention network. In the encoding phase, the table’s structure is integrated into its field representation, and a content selector with self-aligned gates is applied to take advantage of the fact that different records can determine each other’s importance. In the decoding phase, the content selector’s semantic information enhances the alignment between description and records, and a featured copy mechanism is applied to solve the rare word problem. Experiments on WikiBio and WeatherGov datasets show that SAN-T2T outperforms the baselines by a large margin, and the content selector indeed improves the model’s performance.

表到文本生成旨在生成结构化数据（即表）的描述，并已应用于许多领域，如问答系统和搜索引擎。目前的方法大多使用神经语言模型来学习基于注意力机制的输出和输入之间的一致性，但在处理长文本时注意力逐渐减弱，无法利用记录的结构信息，这仍然存在缺陷。为了解决这些问题，我们提出了一种新的生成模型SAN-T2T，它由一个字段内容选择性编码器和一个描述性解码器组成，并与一个选择性注意力网络相连。在编码阶段，表的结构被集成到其字段表示中，并应用具有自对准门的内容选择器，以利用不同记录可以确定彼此的重要性这一事实。在解码阶段，内容选择器的语义信息增强了描述和记录之间的对齐，并应用了一种特色复制机制来解决罕见词问题。在WikiBio和WeatherGov数据集上的实验表明，SAN-T2T在很大程度上优于基线，内容选择器确实提高了模型的性能。

{"title":"SAN-T2T: An automated table-to-text generator based on selective attention network","authors":"Haijie Ding, Xiaolong Xu","doi":"10.1017/s135132492300013x","DOIUrl":"https://doi.org/10.1017/s135132492300013x","url":null,"abstract":"\u0000 Table-to-text generation aims to generate descriptions for structured data (i.e., tables) and has been applied in many fields like question-answering systems and search engines. Current approaches mostly use neural language models to learn alignment between output and input based on the attention mechanisms, which are still flawed by the gradual weakening of attention when processing long texts and the inability to utilize the records’ structural information. To solve these problems, we propose a novel generative model SAN-T2T, which consists of a field-content selective encoder and a descriptive decoder, connected with a selective attention network. In the encoding phase, the table’s structure is integrated into its field representation, and a content selector with self-aligned gates is applied to take advantage of the fact that different records can determine each other’s importance. In the decoding phase, the content selector’s semantic information enhances the alignment between description and records, and a featured copy mechanism is applied to solve the rare word problem. Experiments on WikiBio and WeatherGov datasets show that SAN-T2T outperforms the baselines by a large margin, and the content selector indeed improves the model’s performance.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46933417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Obituary: Yorick Wilks 讣告:约里克·威尔克斯

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-05-01 DOI: 10.1017/s1351324923000256

J. Tait

Yorick was a great friend of Natural Language Engineering. He was a member of the founding editorial board, but more to the point was a sage and encouraging advisor to the Founding Editors Roberto Garigliano, John Tait, and Branimir Boguraev right from the genesis of the project. At the time of his death, Yorick was one of, if not the, doyen of computational linguists. He had been continuously active in the field since 1962. Having graduated in philosophy, he took up a position in Margaret Masterman’s Cambridge Language Research Unit, an eccentric and somewhat informal organisation which started the careers of many pioneers of artificial intelligence and natural language engineering including Karen Spärck Jones, Martin Kay, Margaret Boden, and Roger Needham (thought by some to be the originator of machine learning, as well as much else in computing). Yorick was awarded a PhD in 1968 for work on the use of interlingua in machine translation. His PhD thesis stands out not least for its bright yellow binding (Wilks, 1968). Wilks’ effective PhD supervisor was Margaret Masterman, a student of Wittgenstein’s, although his work was formally directed by the distinguished philosopher Richard Braithwaite, Masterman’s husband, as she lacked an appropriate established position in the University of Cambridge. Inevitably, given the puny computers of the time, Yorick’s PhD work falls well short of the scientific standards of the 21st Century. Despite its shortcomings, his pioneering work influenced many people who have ultimately contributed to the now widespread practical use of machine translation and other automatic language processing systems. In particular, it would be reasonable to surmise that the current success of deep learning systems is based on inferring or inducing a hidden interlingua of the sort Wilks and colleagues tried to handcraft in the 1960s and 1970s. Furthermore, all probabilistic language systems are based on selecting a better or more likely interpretation of a fragment of language over a less likely one, a development of the preference semantics notion originally invented and popularised byWillks (1973, 1975). As a result, his early work continues to be worth studying, not least for the very deep insights careful reading often reveals. Underlying this early work was an interest in metaphor, which Yorick recognised as a pervasive feature of language. This was a topic to which Yorick returned repeatedly throughout his life. Wilks (1978) began to develop his approach, with Barnden (2007) providing a useful summary of work to that date. However, there is much later work – for example Wilks et al. (2013). Wilks was an important figure in the attempt to utilise existing, published dictionaries as a knowledge source for automatic natural language processing systems (Wilks, Slator, and Guthrie, 1996). This endeavour ultimately foundered on the differing interests of commercial dictionary publishers and developers of natural language processing

约里克是自然语言工程的好朋友。他是创始编辑委员会的成员，但更重要的是，从项目开始，他就是创始编辑Roberto Garigliano、John Tait和Branimir Boguraev的明智和鼓舞人心的顾问。约里克去世时，即使不是计算语言学家的元老，也是其中之一。自1962年以来，他一直活跃在这一领域。哲学专业毕业后，他在玛格丽特·马斯特曼的剑桥语言研究所担任了一个职位，这是一个古怪且有点非正式的组织，开创了许多人工智能和自然语言工程先驱的职业生涯，包括Karen Spärck Jones、Martin Kay、Margaret Boden，以及罗杰·李约瑟（被一些人认为是机器学习以及其他许多计算领域的创始人）。约里克于1968年因在机器翻译中使用语际语而获得博士学位。他的博士论文引人注目，尤其是其明亮的黄色装订（Wilks，1968）。威尔克斯有效的博士生导师是维特根斯坦的学生玛格丽特·马斯特曼，尽管他的作品是由马斯特曼的丈夫、著名哲学家理查德·布雷斯韦特正式指导的，因为她在剑桥大学缺乏合适的既定职位。不可避免的是，考虑到当时的计算机很小，约里克的博士研究远远达不到21世纪的科学标准。尽管存在不足，但他的开创性工作影响了许多人，他们最终为机器翻译和其他自动语言处理系统的广泛实际应用做出了贡献。特别是，可以合理地推测，深度学习系统目前的成功是基于推断或诱导威尔克斯及其同事在20世纪60年代和70年代试图手工制作的那种隐藏的语际语言。此外，所有的概率语言系统都是基于选择一个更好或更有可能的语言片段解释，而不是一个不太可能的解释，这是Willks（19731975）最初发明并推广的偏好语义概念的发展。因此，他的早期作品仍然值得研究，尤其是仔细阅读经常揭示的深刻见解。这项早期工作的基础是对隐喻的兴趣，约里克认为隐喻是语言的一个普遍特征。这是约里克一生中反复提及的话题。Wilks（1978）开始发展他的方法，Barnden（2007）对迄今为止的工作进行了有益的总结。然而，还有许多后续工作——例如Wilks等人（2013）。Wilks是试图利用现有出版的词典作为自动自然语言处理系统的知识来源的重要人物（Wilks、Slator和Guthrie，1996）。这一努力最终因商业词典出版商和自然语言处理系统开发人员的不同兴趣而失败。然而，这些早期的努力刺激了开源资源的开发，尤其是Wordnet（Fellbaum，1998），其中许多仍在广泛使用。

{"title":"Obituary: Yorick Wilks","authors":"J. Tait","doi":"10.1017/s1351324923000256","DOIUrl":"https://doi.org/10.1017/s1351324923000256","url":null,"abstract":"Yorick was a great friend of Natural Language Engineering. He was a member of the founding editorial board, but more to the point was a sage and encouraging advisor to the Founding Editors Roberto Garigliano, John Tait, and Branimir Boguraev right from the genesis of the project. At the time of his death, Yorick was one of, if not the, doyen of computational linguists. He had been continuously active in the field since 1962. Having graduated in philosophy, he took up a position in Margaret Masterman’s Cambridge Language Research Unit, an eccentric and somewhat informal organisation which started the careers of many pioneers of artificial intelligence and natural language engineering including Karen Spärck Jones, Martin Kay, Margaret Boden, and Roger Needham (thought by some to be the originator of machine learning, as well as much else in computing). Yorick was awarded a PhD in 1968 for work on the use of interlingua in machine translation. His PhD thesis stands out not least for its bright yellow binding (Wilks, 1968). Wilks’ effective PhD supervisor was Margaret Masterman, a student of Wittgenstein’s, although his work was formally directed by the distinguished philosopher Richard Braithwaite, Masterman’s husband, as she lacked an appropriate established position in the University of Cambridge. Inevitably, given the puny computers of the time, Yorick’s PhD work falls well short of the scientific standards of the 21st Century. Despite its shortcomings, his pioneering work influenced many people who have ultimately contributed to the now widespread practical use of machine translation and other automatic language processing systems. In particular, it would be reasonable to surmise that the current success of deep learning systems is based on inferring or inducing a hidden interlingua of the sort Wilks and colleagues tried to handcraft in the 1960s and 1970s. Furthermore, all probabilistic language systems are based on selecting a better or more likely interpretation of a fragment of language over a less likely one, a development of the preference semantics notion originally invented and popularised byWillks (1973, 1975). As a result, his early work continues to be worth studying, not least for the very deep insights careful reading often reveals. Underlying this early work was an interest in metaphor, which Yorick recognised as a pervasive feature of language. This was a topic to which Yorick returned repeatedly throughout his life. Wilks (1978) began to develop his approach, with Barnden (2007) providing a useful summary of work to that date. However, there is much later work – for example Wilks et al. (2013). Wilks was an important figure in the attempt to utilise existing, published dictionaries as a knowledge source for automatic natural language processing systems (Wilks, Slator, and Guthrie, 1996). This endeavour ultimately foundered on the differing interests of commercial dictionary publishers and developers of natural language processing","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"846 - 847"},"PeriodicalIF":2.5,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47052800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0