Natural Language Engineering最新文献_第3页

How you describe procurement calls matters: Predicting outcome of public procurement using call descriptions 你如何描述采购需求:使用需求描述预测公共采购的结果

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-08-10 DOI: 10.1017/s135132492300030x

U. Acikalin, Mustafa Kaan Gorgun, Mucahid Kutlu, B. Tas

A competitive and cost-effective public procurement (PP) process is essential for the effective use of public resources. In this work, we explore whether descriptions of procurement calls can be used to predict their outcomes. In particular, we focus on predicting four well-known economic metrics: (i) the number of offers, (ii) whether only a single offer is received, (iii) whether a foreign firm is awarded the contract, and (iv) whether the contract price exceeds the expected price. We extract the European Union’s multilingual PP notices, covering 22 different languages. We investigate fine-tuning multilingual transformer models and propose two approaches: (1) multilayer perceptron (MLP) models with transformer embeddings for each business sector in which the training data are filtered based on the procurement category and (2) a k-nearest neighbor (KNN)-based approach fine-tuned using triplet networks. The fine-tuned MBERT model outperforms all other models in predicting calls with a single offer and foreign contract awards, whereas our MLP-based filtering approach yields state-of-the-art results in predicting contracts in which the contract price exceeds the expected price. Furthermore, our KNN-based approach outperforms all the baselines in all tasks and our other proposed models in predicting the number of offers. Moreover, we investigate cross-lingual and multilingual training for our tasks and observe that multilingual training improves prediction accuracy in all our tasks. Overall, our experiments suggest that notice descriptions play an important role in the outcomes of PP calls.

具有竞争力和成本效益的公共采购程序对于有效利用公共资源至关重要。在这项工作中，我们探讨了采购电话的描述是否可以用来预测其结果。特别是，我们专注于预测四个众所周知的经济指标：（i）报价数量，（ii）是否只收到一份报价，（iii）是否授予外国公司合同，以及（iv）合同价格是否超过预期价格。我们摘录了欧盟的多语言PP通知，涵盖22种不同的语言。我们研究了微调多语言转换器模型，并提出了两种方法：（1）每个业务部门的多层感知器（MLP）模型，其中基于采购类别过滤训练数据；（2）使用三元组网络微调的基于k近邻（KNN）的方法。微调后的MBERT模型在预测单一报价和外国合同授予的通话方面优于所有其他模型，而我们基于MLP的过滤方法在预测合同价格超过预期价格的合同方面产生了最先进的结果。此外，我们基于KNN的方法在预测报价数量方面优于所有任务中的所有基线和我们提出的其他模型。此外，我们研究了任务的跨语言和多语言训练，并观察到多语言训练提高了我们所有任务的预测准确性。总之，我们的实验表明，注意描述在PP调用的结果中起着重要作用。

{"title":"How you describe procurement calls matters: Predicting outcome of public procurement using call descriptions","authors":"U. Acikalin, Mustafa Kaan Gorgun, Mucahid Kutlu, B. Tas","doi":"10.1017/s135132492300030x","DOIUrl":"https://doi.org/10.1017/s135132492300030x","url":null,"abstract":"\u0000 A competitive and cost-effective public procurement (PP) process is essential for the effective use of public resources. In this work, we explore whether descriptions of procurement calls can be used to predict their outcomes. In particular, we focus on predicting four well-known economic metrics: (i) the number of offers, (ii) whether only a single offer is received, (iii) whether a foreign firm is awarded the contract, and (iv) whether the contract price exceeds the expected price. We extract the European Union’s multilingual PP notices, covering 22 different languages. We investigate fine-tuning multilingual transformer models and propose two approaches: (1) multilayer perceptron (MLP) models with transformer embeddings for each business sector in which the training data are filtered based on the procurement category and (2) a k-nearest neighbor (KNN)-based approach fine-tuned using triplet networks. The fine-tuned MBERT model outperforms all other models in predicting calls with a single offer and foreign contract awards, whereas our MLP-based filtering approach yields state-of-the-art results in predicting contracts in which the contract price exceeds the expected price. Furthermore, our KNN-based approach outperforms all the baselines in all tasks and our other proposed models in predicting the number of offers. Moreover, we investigate cross-lingual and multilingual training for our tasks and observe that multilingual training improves prediction accuracy in all our tasks. Overall, our experiments suggest that notice descriptions play an important role in the outcomes of PP calls.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42253722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SSL-GAN-RoBERTa: A robust semi-supervised model for detecting Anti-Asian COVID-19 hate speech on social media SSL-GAN-RoBERTa：一个用于检测社交媒体上反阿新冠肺炎仇恨言论的稳健半监督模型

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-08-03 DOI: 10.1017/s1351324923000396

Xuanyu Su, Yansong Li, Paula Branco, D. Inkpen

Anti-Asian speech during the COVID-19 pandemic has been a serious problem with severe consequences. A hate speech wave swept social media platforms. The timely detection of Anti-Asian COVID-19-related hate speech is of utmost importance, not only to allow the application of preventive mechanisms but also to anticipate and possibly prevent other similar discriminatory situations. In this paper, we address the problem of detecting Anti-Asian COVID-19-related hate speech from social media data. Previous approaches that tackled this problem used a transformer-based model, BERT/RoBERTa, trained on the homologous annotated dataset and achieved good performance on this task. However, this requires extensive and annotated datasets with a strong connection to the topic. Both goals are difficult to meet without employing reliable, vast, and costly resources. In this paper, we propose a robust semi-supervised model, SSL-GAN-RoBERTa, that learns from a limited heterogeneous dataset and whose performance is further enhanced by using vast amounts of unlabeled data from another related domain. Compared with the RoBERTa baseline model, the experimental results show that the model has substantial performance gains in terms of Accuracy and Macro-F1 score in different scenarios that use data from different domains. Our proposed model achieves state-of-the-art performance results while efficiently using unlabeled data, showing promising applicability to other complex classification tasks where large amounts of labeled examples are difficult to obtain.

新冠肺炎大流行期间的反亚洲言论是一个严重问题，后果严重。一股仇恨言论浪潮席卷了社交媒体平台。及时发现与新冠肺炎相关的反亚洲仇恨言论至关重要，不仅可以使预防机制得以应用，还可以预测并可能防止其他类似的歧视情况。在本文中，我们解决了从社交媒体数据中检测反亚洲covid -19相关仇恨言论的问题。之前解决该问题的方法使用基于转换器的模型BERT/RoBERTa，该模型在同源注释数据集上进行训练，并在该任务上取得了良好的性能。然而，这需要与主题有紧密联系的广泛且带注释的数据集。如果不使用可靠、庞大和昂贵的资源，这两个目标都很难实现。在本文中，我们提出了一个鲁棒的半监督模型SSL-GAN-RoBERTa，该模型从有限的异构数据集中学习，并通过使用来自另一个相关领域的大量未标记数据进一步增强其性能。与RoBERTa基线模型相比，实验结果表明，该模型在使用不同领域数据的不同场景下，在准确性和宏观f1分数方面都有显著的性能提升。我们提出的模型在有效地使用未标记数据的同时获得了最先进的性能结果，在难以获得大量标记示例的其他复杂分类任务中显示出有希望的适用性。

{"title":"SSL-GAN-RoBERTa: A robust semi-supervised model for detecting Anti-Asian COVID-19 hate speech on social media","authors":"Xuanyu Su, Yansong Li, Paula Branco, D. Inkpen","doi":"10.1017/s1351324923000396","DOIUrl":"https://doi.org/10.1017/s1351324923000396","url":null,"abstract":"\u0000 Anti-Asian speech during the COVID-19 pandemic has been a serious problem with severe consequences. A hate speech wave swept social media platforms. The timely detection of Anti-Asian COVID-19-related hate speech is of utmost importance, not only to allow the application of preventive mechanisms but also to anticipate and possibly prevent other similar discriminatory situations. In this paper, we address the problem of detecting Anti-Asian COVID-19-related hate speech from social media data. Previous approaches that tackled this problem used a transformer-based model, BERT/RoBERTa, trained on the homologous annotated dataset and achieved good performance on this task. However, this requires extensive and annotated datasets with a strong connection to the topic. Both goals are difficult to meet without employing reliable, vast, and costly resources. In this paper, we propose a robust semi-supervised model, SSL-GAN-RoBERTa, that learns from a limited heterogeneous dataset and whose performance is further enhanced by using vast amounts of unlabeled data from another related domain. Compared with the RoBERTa baseline model, the experimental results show that the model has substantial performance gains in terms of Accuracy and Macro-F1 score in different scenarios that use data from different domains. Our proposed model achieves state-of-the-art performance results while efficiently using unlabeled data, showing promising applicability to other complex classification tasks where large amounts of labeled examples are difficult to obtain.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44251205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Masked transformer through knowledge distillation for unsupervised text style transfer 屏蔽变压器通过知识升华实现无监督文本样式转移

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-07-25 DOI: 10.1017/s1351324923000323

Arthur Scalercio, A. Paes

Text style transfer (TST) aims at automatically changing a text’s stylistic features, such as formality, sentiment, authorial style, humor, and complexity, while still trying to preserve its content. Although the scientific community has investigated TST since the 1980s, it has recently regained attention by adopting deep unsupervised strategies to address the challenge of training without parallel data. In this manuscript, we investigate how relying on sequence-to-sequence pretraining models affects the performance of TST when the pretraining step leverages pairs of paraphrase data. Furthermore, we propose a new technique to enhance the sequence-to-sequence model by distilling knowledge from masked language models. We evaluate our proposals on three unsupervised style transfer tasks with widely used benchmarks: author imitation, formality transfer, and polarity swap. The evaluation relies on quantitative and qualitative analyses and comparisons with the results of state-of-the-art models. For the author imitation and the formality transfer task, we show that using the proposed techniques improves all measured metrics and leads to state-of-the-art (SOTA) results in content preservation and an overall score in the author imitation domain. In the formality transfer domain, we paired with the SOTA method in the style control metric. Regarding the polarity swap domain, we show that the knowledge distillation component improves all measured metrics. The paraphrase pretraining increases content preservation at the expense of harming style control. Based on the results reached in these domains, we also discuss in the manuscript if the tasks we address have the same nature and should be equally treated as TST tasks.

文本风格转换(TST)旨在自动改变文本的风格特征，如形式、情感、作者风格、幽默和复杂性，同时仍尽量保留其内容。虽然科学界自20世纪80年代以来一直在研究TST，但最近通过采用深度无监督策略来解决无并行数据训练的挑战，它重新获得了关注。在本文中，我们研究了当预训练步骤利用对意译数据时，依赖序列到序列的预训练模型如何影响TST的性能。此外，我们提出了一种新的技术，通过从屏蔽语言模型中提取知识来增强序列到序列模型。我们在三个无监督风格迁移任务中评估了我们的建议，这些任务使用了广泛使用的基准:作者模仿、形式转换和极性交换。评估依赖于定量和定性分析，并与最先进模型的结果进行比较。对于作者模仿和形式转换任务，我们表明使用所提出的技术改进了所有测量指标，并导致了最先进的(SOTA)结果，内容保存和作者模仿领域的总分。在形式转换领域，我们在风格控制度量中与SOTA方法配对。对于极性交换领域，我们证明了知识蒸馏组件改善了所有测量指标。意译预训练以损害风格控制为代价增加了内容保存。基于这些领域的结果，我们还在手稿中讨论了我们所处理的任务是否具有相同的性质，是否应该平等地视为TST任务。

{"title":"Masked transformer through knowledge distillation for unsupervised text style transfer","authors":"Arthur Scalercio, A. Paes","doi":"10.1017/s1351324923000323","DOIUrl":"https://doi.org/10.1017/s1351324923000323","url":null,"abstract":"\u0000 Text style transfer (TST) aims at automatically changing a text’s stylistic features, such as formality, sentiment, authorial style, humor, and complexity, while still trying to preserve its content. Although the scientific community has investigated TST since the 1980s, it has recently regained attention by adopting deep unsupervised strategies to address the challenge of training without parallel data. In this manuscript, we investigate how relying on sequence-to-sequence pretraining models affects the performance of TST when the pretraining step leverages pairs of paraphrase data. Furthermore, we propose a new technique to enhance the sequence-to-sequence model by distilling knowledge from masked language models. We evaluate our proposals on three unsupervised style transfer tasks with widely used benchmarks: author imitation, formality transfer, and polarity swap. The evaluation relies on quantitative and qualitative analyses and comparisons with the results of state-of-the-art models. For the author imitation and the formality transfer task, we show that using the proposed techniques improves all measured metrics and leads to state-of-the-art (SOTA) results in content preservation and an overall score in the author imitation domain. In the formality transfer domain, we paired with the SOTA method in the style control metric. Regarding the polarity swap domain, we show that the knowledge distillation component improves all measured metrics. The paraphrase pretraining increases content preservation at the expense of harming style control. Based on the results reached in these domains, we also discuss in the manuscript if the tasks we address have the same nature and should be equally treated as TST tasks.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46196456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Assessment of the E3C corpus for the recognition of disorders in clinical texts E3C语料库用于识别临床文本中的障碍的评估

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-07-18 DOI: 10.1017/s1351324923000335

Roberto Zanoli, A. Lavelli, Daniel Verdi do Amarante, Daniele Toti

Disorder named entity recognition (DNER) is a fundamental task of biomedical natural language processing, which has attracted plenty of attention. This task consists in extracting named entities of disorders such as diseases, symptoms, and pathological functions from unstructured text. The European Clinical Case Corpus (E3C) is a freely available multilingual corpus (English, French, Italian, Spanish, and Basque) of semantically annotated clinical case texts. The entities of type disorder in the clinical cases are annotated at both mention and concept level. At mention -level, the annotation identifies the entity text spans, for example, abdominal pain. At concept level, the entity text spans are associated with their concept identifiers in Unified Medical Language System, for example, C0000737. This corpus can be exploited as a benchmark for training and assessing information extraction systems. Within the context of the present work, multiple experiments have been conducted in order to test the appropriateness of the mention-level annotation of the E3C corpus for training DNER models. In these experiments, traditional machine learning models like conditional random fields and more recent multilingual pre-trained models based on deep learning were compared with standard baselines. With regard to the multilingual pre-trained models, they were fine-tuned (i) on each language of the corpus to test per-language performance, (ii) on all languages to test multilingual learning, and (iii) on all languages except the target language to test cross-lingual transfer learning. Results show the appropriateness of the E3C corpus for training a system capable of mining disorder entities from clinical case texts. Researchers can use these results as the baselines for this corpus to compare their own models. The implemented models have been made available through the European Language Grid platform for quick and easy access.

命名实体识别（DNER）是生物医学自然语言处理的一项基本任务，引起了人们的广泛关注。该任务包括从非结构化文本中提取疾病、症状和病理功能等疾病的命名实体。欧洲临床病例语料库（E3C）是一个免费提供的多语言语料库（英语、法语、意大利语、西班牙语和巴斯克语），包含语义注释的临床病例文本。临床病例中类型障碍的实体在提及和概念层面都有注释。在提及级别，注释标识实体文本的跨度，例如腹痛。在概念级别，实体文本跨度与统一医学语言系统中的概念标识符相关联，例如C0000737。该语料库可以作为训练和评估信息提取系统的基准。在本工作的背景下，已经进行了多个实验，以测试E3C语料库的提及水平注释用于训练DNER模型的适当性。在这些实验中，将传统的机器学习模型（如条件随机场）和最近基于深度学习的多语言预训练模型与标准基线进行了比较。关于多语言预训练模型，它们进行了微调：（i）对语料库中的每种语言进行微调，以测试每种语言的表现；（ii）对所有语言进行微调以测试多语言学习；（iii）对除目标语言外的所有语言进行调整，以测试跨语言迁移学习。结果显示了E3C语料库用于训练能够从临床病例文本中挖掘障碍实体的系统的适当性。研究人员可以将这些结果作为该语料库的基线来比较他们自己的模型。实现的模型已通过欧洲语言网格平台提供，以便快速方便地访问。

{"title":"Assessment of the E3C corpus for the recognition of disorders in clinical texts","authors":"Roberto Zanoli, A. Lavelli, Daniel Verdi do Amarante, Daniele Toti","doi":"10.1017/s1351324923000335","DOIUrl":"https://doi.org/10.1017/s1351324923000335","url":null,"abstract":"\u0000 Disorder named entity recognition (DNER) is a fundamental task of biomedical natural language processing, which has attracted plenty of attention. This task consists in extracting named entities of disorders such as diseases, symptoms, and pathological functions from unstructured text. The European Clinical Case Corpus (E3C) is a freely available multilingual corpus (English, French, Italian, Spanish, and Basque) of semantically annotated clinical case texts. The entities of type disorder in the clinical cases are annotated at both mention and concept level. At mention -level, the annotation identifies the entity text spans, for example, abdominal pain. At concept level, the entity text spans are associated with their concept identifiers in Unified Medical Language System, for example, C0000737. This corpus can be exploited as a benchmark for training and assessing information extraction systems. Within the context of the present work, multiple experiments have been conducted in order to test the appropriateness of the mention-level annotation of the E3C corpus for training DNER models. In these experiments, traditional machine learning models like conditional random fields and more recent multilingual pre-trained models based on deep learning were compared with standard baselines. With regard to the multilingual pre-trained models, they were fine-tuned (i) on each language of the corpus to test per-language performance, (ii) on all languages to test multilingual learning, and (iii) on all languages except the target language to test cross-lingual transfer learning. Results show the appropriateness of the E3C corpus for training a system capable of mining disorder entities from clinical case texts. Researchers can use these results as the baselines for this corpus to compare their own models. The implemented models have been made available through the European Language Grid platform for quick and easy access.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43124465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Describe the house and I will tell you the price: House price prediction with textual description data 描述房子，我告诉你价格:房价预测用文字描述数据

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-07-18 DOI: 10.1017/s1351324923000360

Han Zhang, Yansong Li, Paula Branco

House price prediction is an important problem that could benefit home buyers and sellers. Traditional models for house price prediction use numerical attributes such as the number of rooms but disregard the house description text. The recent developments in text processing suggest these can be valuable attributes, which motivated us to use house descriptions. This paper focuses on the house asking/advertising price and studies the impact of using house description texts to predict the final house price. To achieve this, we collected a large and diverse set of attributes on house postings, including the house advertising price. Then, we compare the performance of three scenarios: using only the house description, only numeric attributes, or both. We processed the description text through three word embedding techniques: TF-IDF, Word2Vec, and BERT. Four regression algorithms are trained using only textual data, non-textual data, or both. Our results show that by using exclusively the description data with Word2Vec and a Deep Learning model, we can achieve good performance. However, the best overall performance is obtained when using both textual and non-textual features. An $R^2$ of 0.7904 is achieved by the deep learning model using only description data on the testing data. This clearly indicates that using the house description text alone is a strong predictor for the house price. However, when observing the RMSE on the test data, the best model was gradient boosting using both numeric and description data. Overall, we observe that combining the textual and non-textual features improves the learned model and provides performance benefits when compared against using only one of the feature types. We also provide a freely available application for house price prediction, which is solely based on a house text description and uses our final developed model with Word2Vec and Deep Learning to predict the house price.

房价预测是一个对购房者和卖家都有利的重要问题。传统的房价预测模型使用数字属性，如房间数，但忽略房屋描述文本。文本处理的最新发展表明，这些可能是有价值的属性，这促使我们使用房屋描述。本文以房屋要价/广告价格为研究对象，研究使用房屋描述文本预测最终房价的影响。为了实现这一目标，我们收集了大量不同的房屋广告属性，包括房屋广告价格。然后，我们比较三种场景的性能:仅使用房屋描述，仅使用数字属性，或两者兼而有之。我们通过三种词嵌入技术处理描述文本:TF-IDF、Word2Vec和BERT。四种回归算法仅使用文本数据、非文本数据或两者进行训练。我们的研究结果表明，通过Word2Vec和深度学习模型单独使用描述数据，我们可以获得很好的性能。然而，当同时使用文本和非文本特征时，可以获得最佳的总体性能。深度学习模型仅使用测试数据上的描述数据获得了0.7904的R^2$。这清楚地表明，单独使用房屋描述文本是房价的有力预测指标。然而，当观察测试数据上的RMSE时，最好的模型是同时使用数值和描述数据的梯度增强。总的来说，我们观察到，与只使用一种特征类型相比，结合文本和非文本特征可以改善学习模型，并提供性能优势。我们还提供了一个免费的房价预测应用程序，它完全基于房屋文本描述，并使用我们最终开发的带有Word2Vec和深度学习的模型来预测房价。

{"title":"Describe the house and I will tell you the price: House price prediction with textual description data","authors":"Han Zhang, Yansong Li, Paula Branco","doi":"10.1017/s1351324923000360","DOIUrl":"https://doi.org/10.1017/s1351324923000360","url":null,"abstract":"\u0000 House price prediction is an important problem that could benefit home buyers and sellers. Traditional models for house price prediction use numerical attributes such as the number of rooms but disregard the house description text. The recent developments in text processing suggest these can be valuable attributes, which motivated us to use house descriptions. This paper focuses on the house asking/advertising price and studies the impact of using house description texts to predict the final house price. To achieve this, we collected a large and diverse set of attributes on house postings, including the house advertising price. Then, we compare the performance of three scenarios: using only the house description, only numeric attributes, or both. We processed the description text through three word embedding techniques: TF-IDF, Word2Vec, and BERT. Four regression algorithms are trained using only textual data, non-textual data, or both. Our results show that by using exclusively the description data with Word2Vec and a Deep Learning model, we can achieve good performance. However, the best overall performance is obtained when using both textual and non-textual features. An \u0000 \u0000 \u0000 \u0000$R^2$\u0000\u0000 \u0000 of 0.7904 is achieved by the deep learning model using only description data on the testing data. This clearly indicates that using the house description text alone is a strong predictor for the house price. However, when observing the RMSE on the test data, the best model was gradient boosting using both numeric and description data. Overall, we observe that combining the textual and non-textual features improves the learned model and provides performance benefits when compared against using only one of the feature types. We also provide a freely available application for house price prediction, which is solely based on a house text description and uses our final developed model with Word2Vec and Deep Learning to predict the house price.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46065225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Navigating the text generation revolution: Traditional data-to-text NLG companies and the rise of ChatGPT 引导文本生成革命:传统的数据到文本的NLG公司和ChatGPT的兴起

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-07-01 DOI: 10.1017/S1351324923000347

R. Dale

Abstract Since the release of ChatGPT at the end of November 2022, generative AI has been talked about endlessly in both the technical press and the mainstream media. Large language model technology has been heralded as many things: the disruption of the search engine, the end of the student essay, the bringer of disinformation … but what does it mean for commercial providers of earlier iterations of natural language generation technology? We look at how the major players in the space are responding, and where things might go in the future.

摘要自2022年11月底ChatGPT发布以来，生成人工智能在技术媒体和主流媒体上一直被谈论不休。大型语言模型技术被认为有很多东西：搜索引擎的破坏、学生论文的结束、虚假信息的传播……但这对自然语言生成技术早期迭代的商业提供商意味着什么？我们来看看太空中的主要参与者是如何应对的，以及未来的发展方向。

引用次数: 0

Korean named entity recognition based on language-specific features 基于特定语言特征的韩文命名实体识别

3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-06-29 DOI: 10.1017/s1351324923000311

Yige Chen, KyungTae Lim, Jungyeul Park

Abstract In this paper, we propose a novel way of improving named entity recognition (NER) in the Korean language using its language-specific features. While the field of NER has been studied extensively in recent years, the mechanism of efficiently recognizing named entities (NEs) in Korean has hardly been explored. This is because the Korean language has distinct linguistic properties that present challenges for modeling. Therefore, an annotation scheme for Korean corpora by adopting the CoNLL-U format, which decomposes Korean words into morphemes and reduces the ambiguity of NEs in the original segmentation that may contain functional morphemes such as postpositions and particles, is proposed herein. We investigate how the NE tags are best represented in this morpheme-based scheme and implement an algorithm to convert word-based and syllable-based Korean corpora with NEs into the proposed morpheme-based format. Analyses of the results of traditional and neural models reveal that the proposed morpheme-based format is feasible, and the varied performances of the models under the influence of various additional language-specific features are demonstrated. Extrinsic conditions were also considered to observe the variance of the performances of the proposed models, given different types of data, including the original segmentation and different types of tagging formats.

在本文中，我们提出了一种新的方法来改进韩国语中的命名实体识别(NER)，利用其特定的语言特征。近年来，人们对命名实体领域进行了广泛的研究，但对韩国语中有效识别命名实体(NEs)的机制却鲜有研究。这是因为韩国语具有独特的语言特性，这给建模带来了挑战。因此，本文提出了一种采用CoNLL-U格式的韩语语料库标注方案，将韩语单词分解为语素，减少原始分词中可能包含后置词、助词等功能语素的网元歧义。我们研究了如何在基于语素的方案中最好地表示网元标签，并实现了一种算法，将基于词和音节的韩语语料库转换为基于语素的格式。对传统模型和神经模型的分析结果表明，基于语素的格式是可行的，并证明了模型在各种附加语言特定特征影响下的不同性能。在给定不同类型的数据(包括原始分割和不同类型的标记格式)时，还考虑了外在条件来观察所提出模型性能的差异。

{"title":"Korean named entity recognition based on language-specific features","authors":"Yige Chen, KyungTae Lim, Jungyeul Park","doi":"10.1017/s1351324923000311","DOIUrl":"https://doi.org/10.1017/s1351324923000311","url":null,"abstract":"Abstract In this paper, we propose a novel way of improving named entity recognition (NER) in the Korean language using its language-specific features. While the field of NER has been studied extensively in recent years, the mechanism of efficiently recognizing named entities (NEs) in Korean has hardly been explored. This is because the Korean language has distinct linguistic properties that present challenges for modeling. Therefore, an annotation scheme for Korean corpora by adopting the CoNLL-U format, which decomposes Korean words into morphemes and reduces the ambiguity of NEs in the original segmentation that may contain functional morphemes such as postpositions and particles, is proposed herein. We investigate how the NE tags are best represented in this morpheme-based scheme and implement an algorithm to convert word-based and syllable-based Korean corpora with NEs into the proposed morpheme-based format. Analyses of the results of traditional and neural models reveal that the proposed morpheme-based format is feasible, and the varied performances of the models under the influence of various additional language-specific features are demonstrated. Extrinsic conditions were also considered to observe the variance of the performances of the proposed models, given different types of data, including the original segmentation and different types of tagging formats.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135049783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Linguistically aware evaluation of coreference resolution from the perspective of higher-level applications 从更高层次应用的角度对共指消解的语言意识评价

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-06-19 DOI: 10.1017/s1351324923000293

Voldemaras Žitkus, R. Butkienė, R. Butleris

Coreference resolution is an important part of natural language processing used in machine translation, semantic search, and various other information retrieval and understanding systems. One of the challenges in this field is an evaluation of resolution approaches. There are many different metrics proposed, but most of them rely on certain assumptions, like equivalence between different mentions of the same discourse-world entity, and do not account for overrepresentation of certain types of coreferences present in the evaluation data. In this paper, a new coreference evaluation strategy that focuses on linguistic and semantic information is presented that can address some of these shortcomings. Evaluation model was developed in the broader context of developing coreference resolution capabilities for Lithuanian language; therefore, the experiment was also carried out using Lithuanian language resources, but the proposed evaluation strategy is not language-dependent.

共指解析是自然语言处理的重要组成部分，用于机器翻译、语义搜索和各种其他信息检索和理解系统。这一领域的挑战之一是对解决方法的评价。提出了许多不同的度量标准，但大多数都依赖于某些假设，例如同一话语世界实体的不同提及之间的等价性，并且没有考虑到评估数据中存在的某些类型的共同引用的过度代表性。本文提出了一种新的基于语言和语义信息的互指评价策略，可以解决这些问题。评价模型是在发展立陶宛语共同参照解决能力的更广泛背景下制定的;因此，实验也使用了立陶宛语言资源，但所提出的评价策略不依赖于语言。

引用次数: 0

A resampling-based method to evaluate NLI models 一种基于重采样的NLI模型评估方法

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-06-09 DOI: 10.1017/s1351324923000268

Felipe Salvatore, M. Finger, R. Hirata, A. G. Patriota

The recent progress of deep learning techniques has produced models capable of achieving high scores on traditional Natural Language Inference (NLI) datasets. To understand the generalization limits of these powerful models, an increasing number of adversarial evaluation schemes have appeared. These works use a similar evaluation method: they construct a new NLI test set based on sentences with known logic and semantic properties (the adversarial set), train a model on a benchmark NLI dataset, and evaluate it in the new set. Poor performance on the adversarial set is identified as a model limitation. The problem with this evaluation procedure is that it may only indicate a sampling problem. A machine learning model can perform poorly on a new test set because the text patterns presented in the adversarial set are not well represented in the training sample. To address this problem, we present a new evaluation method, the Invariance under Equivalence test (IE test). The IE test trains a model with sufficient adversarial examples and checks the model’s performance on two equivalent datasets. As a case study, we apply the IE test to the state-of-the-art NLI models using synonym substitution as the form of adversarial examples. The experiment shows that, despite their high predictive power, these models usually produce different inference outputs for equivalent inputs, and, more importantly, this deficiency cannot be solved by adding adversarial observations in the training data.

深度学习技术的最新进展已经产生了能够在传统的自然语言推理(NLI)数据集上获得高分的模型。为了理解这些强大模型的泛化限制，出现了越来越多的对抗性评估方案。这些工作使用了类似的评估方法:他们基于具有已知逻辑和语义属性的句子(对抗集)构建一个新的NLI测试集，在基准NLI数据集上训练模型，并在新集中对其进行评估。在对抗集上表现不佳被认为是模型的局限性。这个评估过程的问题是，它可能只表明一个抽样问题。机器学习模型在新的测试集中可能表现不佳，因为在对抗集中呈现的文本模式在训练样本中没有很好地表示。为了解决这一问题，我们提出了一种新的评价方法——等价不变性检验(IE检验)。IE测试用足够的对抗性示例训练模型，并在两个等效数据集上检查模型的性能。作为一个案例研究，我们将IE测试应用于最先进的NLI模型，使用同义词替换作为对抗示例的形式。实验表明，尽管这些模型具有很高的预测能力，但对于相同的输入，通常会产生不同的推理输出，更重要的是，这一缺陷无法通过在训练数据中添加对抗性观察值来解决。

{"title":"A resampling-based method to evaluate NLI models","authors":"Felipe Salvatore, M. Finger, R. Hirata, A. G. Patriota","doi":"10.1017/s1351324923000268","DOIUrl":"https://doi.org/10.1017/s1351324923000268","url":null,"abstract":"\u0000 The recent progress of deep learning techniques has produced models capable of achieving high scores on traditional Natural Language Inference (NLI) datasets. To understand the generalization limits of these powerful models, an increasing number of adversarial evaluation schemes have appeared. These works use a similar evaluation method: they construct a new NLI test set based on sentences with known logic and semantic properties (the adversarial set), train a model on a benchmark NLI dataset, and evaluate it in the new set. Poor performance on the adversarial set is identified as a model limitation. The problem with this evaluation procedure is that it may only indicate a sampling problem. A machine learning model can perform poorly on a new test set because the text patterns presented in the adversarial set are not well represented in the training sample. To address this problem, we present a new evaluation method, the Invariance under Equivalence test (IE test). The IE test trains a model with sufficient adversarial examples and checks the model’s performance on two equivalent datasets. As a case study, we apply the IE test to the state-of-the-art NLI models using synonym substitution as the form of adversarial examples. The experiment shows that, despite their high predictive power, these models usually produce different inference outputs for equivalent inputs, and, more importantly, this deficiency cannot be solved by adding adversarial observations in the training data.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44366636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Explainable Natural Language Processing, by Anders Søgaard. San Rafael, CA: Morgan & Claypool, 2021. ISBN 978-1-636-39213-4. XV+107 pages. 《可解释的自然语言处理》，Anders Søgaard著。加利福尼亚州圣拉斐尔：Morgan&Claypool，2021。ISBN 978-1-636-39213-4。XV+107页。

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering

Pub Date : 2023-06-02 DOI: 10.1017/s1351324923000281

Zihao Zhang

引用次数: 0