首页 > 最新文献

Recent Advances in Natural Language Processing最新文献

英文 中文
Contextual-Lexicon Approach for Abusive Language Detection 滥用语言检测的语境-词汇方法
Pub Date : 2021-04-25 DOI: 10.26615/978-954-452-072-4_161
F. Vargas, F. Góes, Isabelle Carvalho, Fabrício Benevenuto, T. Pardo
Since a lexicon-based approach is more elegant scientifically, explaining the solution components and being easier to generalize to other applications, this paper provides a new approach for offensive language and hate speech detection on social media, which embodies a lexicon of implicit and explicit offensive and swearing expressions annotated with contextual information. Due to the severity of the social media abusive comments in Brazil, and the lack of research in Portuguese, Brazilian Portuguese is the language used to validate the models. Nevertheless, our method may be applied to any other language. The conducted experiments show the effectiveness of the proposed approach, outperforming the current baseline methods for the Portuguese language.
由于基于词典的方法更加优雅科学,解释了解决方案的组成部分,并且更容易推广到其他应用中,因此本文为社交媒体上的攻击性语言和仇恨言论检测提供了一种新的方法,该方法体现了一个带有上下文信息注释的隐式和显式攻击性和咒骂表达的词典。由于巴西社交媒体上辱骂性评论的严重性,以及葡萄牙语研究的缺乏,巴西葡萄牙语是用来验证模型的语言。然而,我们的方法可以应用于任何其他语言。所进行的实验表明,所提出的方法的有效性,优于目前的葡萄牙语基线方法。
{"title":"Contextual-Lexicon Approach for Abusive Language Detection","authors":"F. Vargas, F. Góes, Isabelle Carvalho, Fabrício Benevenuto, T. Pardo","doi":"10.26615/978-954-452-072-4_161","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_161","url":null,"abstract":"Since a lexicon-based approach is more elegant scientifically, explaining the solution components and being easier to generalize to other applications, this paper provides a new approach for offensive language and hate speech detection on social media, which embodies a lexicon of implicit and explicit offensive and swearing expressions annotated with contextual information. Due to the severity of the social media abusive comments in Brazil, and the lack of research in Portuguese, Brazilian Portuguese is the language used to validate the models. Nevertheless, our method may be applied to any other language. The conducted experiments show the effectiveness of the proposed approach, outperforming the current baseline methods for the Portuguese language.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115020601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Recognizing and Splitting Conditional Sentences for Automation of Business Processes Management 面向业务流程管理自动化的条件句识别与拆分
Pub Date : 2021-04-01 DOI: 10.26615/978-954-452-072-4_167
Ngoc Phuoc An Vo, Irene Manotas, Octavian Popescu, A. Černiauskas, V. Sheinin
Business Process Management (BPM) is the discipline which is responsible for management of discovering, analyzing, redesigning, monitoring, and controlling business processes. One of the most crucial tasks of BPM is discovering and modelling business processes from text documents. In this paper, we present our system that resolves an end-to-end problem consisting of 1) recognizing conditional sentences from technical documents, 2) finding boundaries to extract conditional and resultant clauses from each conditional sentence, and 3) categorizing resultant clause as Action or Consequence which later helps to generate new steps in our business process model automatically. We created a new dataset and three models to solve this problem. Our best model achieved very promising results of 83.82, 87.84, and 85.75 for Precision, Recall, and F1, respectively, for extracting Condition, Action, and Consequence clauses using Exact Match metric.
业务流程管理(BPM)是负责发现、分析、重新设计、监视和控制业务流程的管理的学科。BPM最重要的任务之一是从文本文档中发现和建模业务流程。在本文中,我们提出了一个解决端到端问题的系统,该系统包括:1)从技术文档中识别条件句;2)从每个条件句中找到提取条件和结果子句的边界;3)将结果子句分类为Action或Consequence,这有助于在业务流程模型中自动生成新步骤。我们创建了一个新的数据集和三个模型来解决这个问题。我们的最佳模型在使用Exact Match度量提取条件、动作和后果子句方面,在Precision、Recall和F1方面分别取得了83.82、87.84和85.75的非常有希望的结果。
{"title":"Recognizing and Splitting Conditional Sentences for Automation of Business Processes Management","authors":"Ngoc Phuoc An Vo, Irene Manotas, Octavian Popescu, A. Černiauskas, V. Sheinin","doi":"10.26615/978-954-452-072-4_167","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_167","url":null,"abstract":"Business Process Management (BPM) is the discipline which is responsible for management of discovering, analyzing, redesigning, monitoring, and controlling business processes. One of the most crucial tasks of BPM is discovering and modelling business processes from text documents. In this paper, we present our system that resolves an end-to-end problem consisting of 1) recognizing conditional sentences from technical documents, 2) finding boundaries to extract conditional and resultant clauses from each conditional sentence, and 3) categorizing resultant clause as Action or Consequence which later helps to generate new steps in our business process model automatically. We created a new dataset and three models to solve this problem. Our best model achieved very promising results of 83.82, 87.84, and 85.75 for Precision, Recall, and F1, respectively, for extracting Condition, Action, and Consequence clauses using Exact Match metric.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129678958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Czert – Czech BERT-like Model for Language Representation 语言表示的捷克类bert模型
Pub Date : 2021-03-24 DOI: 10.26615/978-954-452-072-4_149
Jakub Sido, O. Pražák, P. Pribán, Jan Pasek, Michal Seják, Miloslav Konopík
This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pre-trained and fine-tuned models freely for the research community.
本文描述了基于BERT和ALBERT体系结构的首个捷克语单语语言表示模型的训练过程。我们在超过340K的句子上预训练我们的模型,这是包含捷克语数据的多语言模型的50倍。我们在11个数据集中的9个上优于多语言模型。此外,我们在9个数据集上建立了新的最先进的结果。最后,我们根据我们的结果讨论了单语言和多语言模型的性质。我们为研究界免费发布所有预训练和微调的模型。
{"title":"Czert – Czech BERT-like Model for Language Representation","authors":"Jakub Sido, O. Pražák, P. Pribán, Jan Pasek, Michal Seják, Miloslav Konopík","doi":"10.26615/978-954-452-072-4_149","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_149","url":null,"abstract":"This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pre-trained and fine-tuned models freely for the research community.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124657635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Transforming Multi-Conditioned Generation from Meaning Representation 从意义表示转化多条件生成
Pub Date : 2021-01-12 DOI: 10.26615/978-954-452-072-4_092
Joosung Lee
Our study focuses on language generation by considering various information representing the meaning of utterances as multiple conditions of generation. Generating an utterance from a Meaning representation (MR) usually passes two steps: sentence planning and surface realization. However, we propose a simple one-stage framework to generate utterances directly from MR. Our model is based on GPT2 and generates utterances with flat conditions on slot and value pairs, which does not need to determine the structure of the sentence. We evaluate several systems in the E2E dataset with 6 automatic metrics. Our system is a simple method, but it demonstrates comparable performance to previous systems in automated metrics. In addition, using only 10% of the dataset without any other techniques, our model achieves comparable performance, and shows the possibility of performing zero-shot generation and expanding to other datasets.
我们的研究侧重于语言生成,将代表话语意义的各种信息作为多种生成条件来考虑。从意义表示生成话语通常要经过两个步骤:句子规划和表面实现。然而,我们提出了一个简单的单阶段框架来直接从mr中生成话语。我们的模型基于GPT2,在槽对和值对上生成具有平坦条件的话语,不需要确定句子的结构。我们用6个自动指标评估了E2E数据集中的几个系统。我们的系统是一种简单的方法,但它在自动化度量中展示了与以前系统相当的性能。此外,在没有任何其他技术的情况下,仅使用10%的数据集,我们的模型实现了相当的性能,并显示了执行零射击生成和扩展到其他数据集的可能性。
{"title":"Transforming Multi-Conditioned Generation from Meaning Representation","authors":"Joosung Lee","doi":"10.26615/978-954-452-072-4_092","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_092","url":null,"abstract":"Our study focuses on language generation by considering various information representing the meaning of utterances as multiple conditions of generation. Generating an utterance from a Meaning representation (MR) usually passes two steps: sentence planning and surface realization. However, we propose a simple one-stage framework to generate utterances directly from MR. Our model is based on GPT2 and generates utterances with flat conditions on slot and value pairs, which does not need to determine the structure of the sentence. We evaluate several systems in the E2E dataset with 6 automatic metrics. Our system is a simple method, but it demonstrates comparable performance to previous systems in automated metrics. In addition, using only 10% of the dataset without any other techniques, our model achieves comparable performance, and shows the possibility of performing zero-shot generation and expanding to other datasets.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114164680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Knowledge Discovery in COVID-19 Research Literature COVID-19研究文献中的知识发现
Pub Date : 2020-08-12 DOI: 10.18653/v1/2020.nlpcovid19-2.22
Alejandro Piad-Morffis, Suilan Estévez-Velarde, Ernesto L. Estevanell-Valladares, Yoan Gutiérrez Vázquez, A. Montoyo, R. Muñoz, Yudivián Almeida-Cruz
This paper presents the preliminary results of an ongoing project that analyzes the growing body of scientific research published around the COVID-19 pandemic. In this research, a general-purpose semantic model is used to double annotate a batch of 500 sentences that were manually selected from the CORD-19 corpus. Afterwards, a baseline text-mining pipeline is designed and evaluated via a large batch of 100,959 sentences. We present a qualitative analysis of the most interesting facts automatically extracted and highlight possible future lines of development. The preliminary results show that general-purpose semantic models are a useful tool for discovering fine-grained knowledge in large corpora of scientific documents.
本文介绍了一个正在进行的项目的初步结果,该项目分析了围绕COVID-19大流行发表的越来越多的科学研究。在本研究中,使用通用语义模型对从CORD-19语料库中手动选择的500个句子进行双标注。然后,设计一个基线文本挖掘管道,并通过大量的100,959个句子进行评估。我们提出了一个定性分析最有趣的事实自动提取和突出可能的未来发展路线。初步结果表明,通用语义模型是在大型科学文献语料库中发现细粒度知识的有效工具。
{"title":"Knowledge Discovery in COVID-19 Research Literature","authors":"Alejandro Piad-Morffis, Suilan Estévez-Velarde, Ernesto L. Estevanell-Valladares, Yoan Gutiérrez Vázquez, A. Montoyo, R. Muñoz, Yudivián Almeida-Cruz","doi":"10.18653/v1/2020.nlpcovid19-2.22","DOIUrl":"https://doi.org/10.18653/v1/2020.nlpcovid19-2.22","url":null,"abstract":"This paper presents the preliminary results of an ongoing project that analyzes the growing body of scientific research published around the COVID-19 pandemic. In this research, a general-purpose semantic model is used to double annotate a batch of 500 sentences that were manually selected from the CORD-19 corpus. Afterwards, a baseline text-mining pipeline is designed and evaluated via a large batch of 100,959 sentences. We present a qualitative analysis of the most interesting facts automatically extracted and highlight possible future lines of development. The preliminary results show that general-purpose semantic models are a useful tool for discovering fine-grained knowledge in large corpora of scientific documents.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115533478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Evaluating the Consistency of Word Embeddings from Small Data 基于小数据的词嵌入一致性评估
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_016
Jelke Bloem, Antske Fokkens, Aurélie Herbelot, Computational Lexicology
In this work, we address the evaluation of distributional semantic models trained on smaller, domain-specific texts, specifically, philosophical text. Specifically, we inspect the behaviour of models using a pre-trained background space in learning. We propose a measure of consistency which can be used as an evaluation metric when no in-domain gold-standard data is available. This measure simply computes the ability of a model to learn similar embeddings from different parts of some homogeneous data. We show that in spite of being a simple evaluation, consistency actually depends on various combinations of factors, including the nature of the data itself, the model used to train the semantic space, and the frequency of the learnt terms, both in the background space and in the in-domain data of interest.
在这项工作中,我们解决了在较小的、特定领域的文本(特别是哲学文本)上训练的分布式语义模型的评估。具体来说,我们在学习中使用预训练的背景空间来检查模型的行为。我们提出了一种一致性度量,当没有域内金标准数据可用时,它可以用作评估度量。这种方法简单地计算模型从同质数据的不同部分学习相似嵌入的能力。我们表明,尽管是一个简单的评估,但一致性实际上取决于各种因素的组合,包括数据本身的性质,用于训练语义空间的模型,以及在背景空间和感兴趣的域内数据中学习的术语的频率。
{"title":"Evaluating the Consistency of Word Embeddings from Small Data","authors":"Jelke Bloem, Antske Fokkens, Aurélie Herbelot, Computational Lexicology","doi":"10.26615/978-954-452-056-4_016","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_016","url":null,"abstract":"In this work, we address the evaluation of distributional semantic models trained on smaller, domain-specific texts, specifically, philosophical text. Specifically, we inspect the behaviour of models using a pre-trained background space in learning. We propose a measure of consistency which can be used as an evaluation metric when no in-domain gold-standard data is available. This measure simply computes the ability of a model to learn similar embeddings from different parts of some homogeneous data. We show that in spite of being a simple evaluation, consistency actually depends on various combinations of factors, including the nature of the data itself, the model used to train the semantic space, and the frequency of the learnt terms, both in the background space and in the in-domain data of interest.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121872276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Question Similarity in Community Question Answering: A Systematic Exploration of Preprocessing Methods and Models 社区问答中的问题相似度:预处理方法和模型的系统探索
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_070
Florian Kunneman, Thiago Castro Ferreira, Antal van den Bosch, E. Krahmer
Community Question Answering forums are popular among Internet users, and a basic problem they encounter is trying to find out if their question has already been posed before. To address this issue, NLP researchers have developed methods to automatically detect question-similarity, which was one of the shared tasks in SemEval. The best performing systems for this task made use of Syntactic Tree Kernels or the SoftCosine metric. However, it remains unclear why these methods seem to work, whether their performance can be improved by better preprocessing methods and what kinds of errors they (and other methods) make. In this paper, we therefore systematically combine and compare these two approaches with the more traditional BM25 and translation-based models. Moreover, we analyze the impact of preprocessing steps (lowercasing, suppression of punctuation and stop words removal) and word meaning similarity based on different distributions (word translation probability, Word2Vec, fastText and ELMo) on the performance of the task. We conduct an error analysis to gain insight into the differences in performance between the system set-ups. The implementation is made publicly available from https://github.com/fkunneman/DiscoSumo/tree/master/ranlp.
社区问答论坛在互联网用户中很受欢迎,他们遇到的一个基本问题是试图找出他们的问题是否已经被提出过。为了解决这个问题,NLP研究人员开发了自动检测问题相似性的方法,这是SemEval中的共享任务之一。执行此任务的最佳系统使用语法树核或软余弦度量。然而,目前还不清楚为什么这些方法似乎有效,是否可以通过更好的预处理方法来提高它们的性能,以及它们(和其他方法)会产生什么样的错误。因此,在本文中,我们系统地将这两种方法与更传统的BM25和基于翻译的模型相结合并进行比较。此外,我们还分析了预处理步骤(小写字母、标点符号抑制和停止词去除)和基于不同分布(单词翻译概率、Word2Vec、fastText和ELMo)的词义相似度对任务性能的影响。我们进行错误分析,以深入了解系统设置之间的性能差异。该实现可从https://github.com/fkunneman/DiscoSumo/tree/master/ranlp公开获得。
{"title":"Question Similarity in Community Question Answering: A Systematic Exploration of Preprocessing Methods and Models","authors":"Florian Kunneman, Thiago Castro Ferreira, Antal van den Bosch, E. Krahmer","doi":"10.26615/978-954-452-056-4_070","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_070","url":null,"abstract":"Community Question Answering forums are popular among Internet users, and a basic problem they encounter is trying to find out if their question has already been posed before. To address this issue, NLP researchers have developed methods to automatically detect question-similarity, which was one of the shared tasks in SemEval. The best performing systems for this task made use of Syntactic Tree Kernels or the SoftCosine metric. However, it remains unclear why these methods seem to work, whether their performance can be improved by better preprocessing methods and what kinds of errors they (and other methods) make. In this paper, we therefore systematically combine and compare these two approaches with the more traditional BM25 and translation-based models. Moreover, we analyze the impact of preprocessing steps (lowercasing, suppression of punctuation and stop words removal) and word meaning similarity based on different distributions (word translation probability, Word2Vec, fastText and ELMo) on the performance of the task. We conduct an error analysis to gain insight into the differences in performance between the system set-ups. The implementation is made publicly available from https://github.com/fkunneman/DiscoSumo/tree/master/ranlp.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128589246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Semi-Supervised Induction of POS-Tag Lexicons with Tree Models 基于树模型的pos标签词典半监督归纳
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_060
Maciej Janicki
We approach the problem of POS tagging of morphologically rich languages in a setting where only a small amount of labeled training data is available. We show that a bigram HMM tagger benefits from re-training on a larger untagged text using Baum-Welch estimation. Most importantly, this estimation can be significantly improved by pre-guessing tags for OOV words based on morphological criteria. We consider two models for this task: a character-based recurrent neural network, which guesses the tag from the string form of the word, and a recently proposed graph-based model of morphological transformations. In the latter, the unknown POS tags can be modeled as latent variables in a way very similar to Hidden Markov Tree models and an analogue of the Forward-Backward algorithm can be formulated, which enables us to compute expected values over unknown taggings. We evaluate both the quality of the induced tag lexicon and its impact on the HMM’s tagging accuracy. In both tasks, the graph-based morphology model performs significantly better than the RNN predictor. This confirms the intuition that morphologically related words provide useful information about an unknown word’s POS tag.
我们研究了在只有少量标记训练数据可用的情况下,词形丰富语言的词性标注问题。我们表明,使用Baum-Welch估计在更大的未标记文本上进行重新训练,可以使二元HMM标记器受益。最重要的是,通过基于形态学标准预先猜测OOV词的标签,可以显著改善这种估计。我们考虑了两种模型:基于字符的递归神经网络,它从单词的字符串形式猜测标签,以及最近提出的基于图的形态学转换模型。在后者中,未知的POS标记可以以一种非常类似于隐马尔可夫树模型的方式建模为潜在变量,并且可以制定Forward-Backward算法的模拟,这使我们能够计算未知标记的期望值。我们评估了诱导标签词典的质量及其对HMM标注精度的影响。在这两个任务中,基于图的形态学模型的表现明显优于RNN预测器。这证实了一种直觉,即词形相关的词提供了关于未知词的词性标记的有用信息。
{"title":"Semi-Supervised Induction of POS-Tag Lexicons with Tree Models","authors":"Maciej Janicki","doi":"10.26615/978-954-452-056-4_060","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_060","url":null,"abstract":"We approach the problem of POS tagging of morphologically rich languages in a setting where only a small amount of labeled training data is available. We show that a bigram HMM tagger benefits from re-training on a larger untagged text using Baum-Welch estimation. Most importantly, this estimation can be significantly improved by pre-guessing tags for OOV words based on morphological criteria. We consider two models for this task: a character-based recurrent neural network, which guesses the tag from the string form of the word, and a recently proposed graph-based model of morphological transformations. In the latter, the unknown POS tags can be modeled as latent variables in a way very similar to Hidden Markov Tree models and an analogue of the Forward-Backward algorithm can be formulated, which enables us to compute expected values over unknown taggings. We evaluate both the quality of the induced tag lexicon and its impact on the HMM’s tagging accuracy. In both tasks, the graph-based morphology model performs significantly better than the RNN predictor. This confirms the intuition that morphologically related words provide useful information about an unknown word’s POS tag.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129581639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Unsupervised Sentence Similarity Methods with Deep Contextualised Word Representations 基于深度语境化词表示的无监督句子相似度增强方法
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_115
Tharindu Ranasinghe, Constantin Orasan, R. Mitkov
Calculating Semantic Textual Similarity (STS) plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. All modern state of the art STS methods rely on word embeddings one way or another. The recently introduced contextualised word embeddings have proved more effective than standard word embeddings in many natural language processing tasks. This paper evaluates the impact of several contextualised word embeddings on unsupervised STS methods and compares it with the existing supervised/unsupervised STS methods for different datasets in different languages and different domains
语义文本相似度的计算在问答、文档摘要、信息检索和信息抽取等应用中起着重要的作用。所有现代最先进的STS方法都以这样或那样的方式依赖于词嵌入。在许多自然语言处理任务中,最近引入的语境化词嵌入被证明比标准词嵌入更有效。本文评估了几种情境化词嵌入对无监督STS方法的影响,并将其与现有的针对不同语言和不同领域的不同数据集的有监督/无监督STS方法进行了比较
{"title":"Enhancing Unsupervised Sentence Similarity Methods with Deep Contextualised Word Representations","authors":"Tharindu Ranasinghe, Constantin Orasan, R. Mitkov","doi":"10.26615/978-954-452-056-4_115","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_115","url":null,"abstract":"Calculating Semantic Textual Similarity (STS) plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. All modern state of the art STS methods rely on word embeddings one way or another. The recently introduced contextualised word embeddings have proved more effective than standard word embeddings in many natural language processing tasks. This paper evaluates the impact of several contextualised word embeddings on unsupervised STS methods and compares it with the existing supervised/unsupervised STS methods for different datasets in different languages and different domains","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114557524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
A Neural Network Component for Knowledge-Based Semantic Representations of Text 基于知识的文本语义表示的神经网络组件
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_105
Alejandro Piad-Morffis, R. Muñoz, Yudivián Almeida-Cruz, Yoan Gutiérrez Vázquez, Suilan Estévez-Velarde, A. Montoyo
This paper presents Semantic Neural Networks (SNNs), a knowledge-aware component based on deep learning. SNNs can be trained to encode explicit semantic knowledge from an arbitrary knowledge base, and can subsequently be combined with other deep learning architectures. At prediction time, SNNs provide a semantic encoding extracted from the input data, which can be exploited by other neural network components to build extended representation models that can face alternative problems. The SNN architecture is defined in terms of the concepts and relations present in a knowledge base. Based on this architecture, a training procedure is developed. Finally, an experimental setup is presented to illustrate the behaviour and performance of a SNN for a specific NLP problem, in this case, opinion mining for the classification of movie reviews.
提出了一种基于深度学习的知识感知组件——语义神经网络(SNNs)。snn可以被训练来编码来自任意知识库的显式语义知识,并且随后可以与其他深度学习架构相结合。在预测时,snn提供从输入数据中提取的语义编码,其他神经网络组件可以利用该编码来构建可以面对替代问题的扩展表示模型。SNN架构是根据知识库中存在的概念和关系来定义的。在此基础上,开发了培训程序。最后,提出了一个实验设置来说明SNN在特定NLP问题上的行为和性能,在本例中,是用于电影评论分类的意见挖掘。
{"title":"A Neural Network Component for Knowledge-Based Semantic Representations of Text","authors":"Alejandro Piad-Morffis, R. Muñoz, Yudivián Almeida-Cruz, Yoan Gutiérrez Vázquez, Suilan Estévez-Velarde, A. Montoyo","doi":"10.26615/978-954-452-056-4_105","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_105","url":null,"abstract":"This paper presents Semantic Neural Networks (SNNs), a knowledge-aware component based on deep learning. SNNs can be trained to encode explicit semantic knowledge from an arbitrary knowledge base, and can subsequently be combined with other deep learning architectures. At prediction time, SNNs provide a semantic encoding extracted from the input data, which can be exploited by other neural network components to build extended representation models that can face alternative problems. The SNN architecture is defined in terms of the concepts and relations present in a knowledge base. Based on this architecture, a training procedure is developed. Finally, an experimental setup is presented to illustrate the behaviour and performance of a SNN for a specific NLP problem, in this case, opinion mining for the classification of movie reviews.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114621220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Recent Advances in Natural Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1