首页 > 最新文献

International Conference on Applications of Natural Language to Data Bases最新文献

英文 中文
Adversarial Capsule Networks for Romanian Satire Detection and Sentiment Analysis 对抗胶囊网络罗马尼亚讽刺检测和情感分析
Pub Date : 2023-06-13 DOI: 10.1007/978-3-031-35320-8_31
Sebastian-Vasile Echim, Ruazvan-Alexandru Smuadu, Andrei-Marius Avram, Dumitru-Clementin Cercel, Florin-Claudiu Pop
{"title":"Adversarial Capsule Networks for Romanian Satire Detection and Sentiment Analysis","authors":"Sebastian-Vasile Echim, Ruazvan-Alexandru Smuadu, Andrei-Marius Avram, Dumitru-Clementin Cercel, Florin-Claudiu Pop","doi":"10.1007/978-3-031-35320-8_31","DOIUrl":"https://doi.org/10.1007/978-3-031-35320-8_31","url":null,"abstract":"","PeriodicalId":136374,"journal":{"name":"International Conference on Applications of Natural Language to Data Bases","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128739229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RoBERTweet: A BERT Language Model for Romanian Tweets 罗马尼亚语推文的BERT语言模型
Pub Date : 2023-06-11 DOI: 10.48550/arXiv.2306.06598
Iulian-Marius Tuaiatu, Andrei-Marius Avram, Dumitru-Clementin Cercel, Florin-Claudiu Pop
Developing natural language processing (NLP) systems for social media analysis remains an important topic in artificial intelligence research. This article introduces RoBERTweet, the first Transformer architecture trained on Romanian tweets. Our RoBERTweet comes in two versions, following the base and large architectures of BERT. The corpus used for pre-training the models represents a novelty for the Romanian NLP community and consists of all tweets collected from 2008 to 2022. Experiments show that RoBERTweet models outperform the previous general-domain Romanian and multilingual language models on three NLP tasks with tweet inputs: emotion detection, sexist language identification, and named entity recognition. We make our models and the newly created corpus of Romanian tweets freely available.
开发用于社交媒体分析的自然语言处理(NLP)系统仍然是人工智能研究的一个重要课题。本文介绍了RoBERTweet,这是第一个在罗马尼亚tweets上训练的Transformer架构。我们的RoBERTweet有两个版本,遵循BERT的基础架构和大型架构。用于预训练模型的语料库代表了罗马尼亚NLP社区的新事物,由2008年至2022年收集的所有推文组成。实验表明,RoBERTweet模型在tweet输入的三个NLP任务上优于以前的通用领域罗马尼亚语和多语言模型:情感检测、性别歧视语言识别和命名实体识别。我们免费提供我们的模型和新创建的罗马尼亚语推文语料库。
{"title":"RoBERTweet: A BERT Language Model for Romanian Tweets","authors":"Iulian-Marius Tuaiatu, Andrei-Marius Avram, Dumitru-Clementin Cercel, Florin-Claudiu Pop","doi":"10.48550/arXiv.2306.06598","DOIUrl":"https://doi.org/10.48550/arXiv.2306.06598","url":null,"abstract":"Developing natural language processing (NLP) systems for social media analysis remains an important topic in artificial intelligence research. This article introduces RoBERTweet, the first Transformer architecture trained on Romanian tweets. Our RoBERTweet comes in two versions, following the base and large architectures of BERT. The corpus used for pre-training the models represents a novelty for the Romanian NLP community and consists of all tweets collected from 2008 to 2022. Experiments show that RoBERTweet models outperform the previous general-domain Romanian and multilingual language models on three NLP tasks with tweet inputs: emotion detection, sexist language identification, and named entity recognition. We make our models and the newly created corpus of Romanian tweets freely available.","PeriodicalId":136374,"journal":{"name":"International Conference on Applications of Natural Language to Data Bases","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114494005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LonXplain: Lonesomeness as a Consequence of Mental Disturbance in Reddit Posts lonexplain: Reddit帖子中的孤独感是精神障碍的结果
Pub Date : 2023-05-30 DOI: 10.48550/arXiv.2305.18736
Muskan Garg, Chandni Saxena, Debabrata Samanta, B. Dorr
Social media is a potential source of information that infers latent mental states through Natural Language Processing (NLP). While narrating real-life experiences, social media users convey their feeling of loneliness or isolated lifestyle, impacting their mental well-being. Existing literature on psychological theories points to loneliness as the major consequence of interpersonal risk factors, propounding the need to investigate loneliness as a major aspect of mental disturbance. We formulate lonesomeness detection in social media posts as an explainable binary classification problem, discovering the users at-risk, suggesting the need of resilience for early control. To the best of our knowledge, there is no existing explainable dataset, i.e., one with human-readable, annotated text spans, to facilitate further research and development in loneliness detection causing mental disturbance. In this work, three experts: a senior clinical psychologist, a rehabilitation counselor, and a social NLP researcher define annotation schemes and perplexity guidelines to mark the presence or absence of lonesomeness, along with the marking of text-spans in original posts as explanation, in 3,521 Reddit posts. We expect the public release of our dataset, LonXplain, and traditional classifiers as baselines via GitHub.
社交媒体是通过自然语言处理(NLP)推断潜在心理状态的潜在信息来源。在叙述现实生活经历的同时,社交媒体用户传达了他们的孤独感或孤立的生活方式,影响了他们的心理健康。现有的心理学理论文献指出孤独是人际风险因素的主要后果,提出有必要将孤独作为精神障碍的一个主要方面进行研究。我们将社交媒体帖子中的孤独感检测作为一个可解释的二元分类问题,发现处于风险中的用户,这表明早期控制需要弹性。据我们所知,目前还没有可解释的数据集,即一个具有人类可读的、注释的文本跨度的数据集,以促进在孤独感检测引起精神障碍方面的进一步研究和发展。在这项工作中,三位专家:一位高级临床心理学家、一位康复咨询师和一位社会NLP研究人员,在3521个Reddit帖子中定义了标注方案和困惑指南,以标记孤独的存在或不存在,并在原始帖子中标记文本范围作为解释。我们期待通过GitHub公开发布我们的数据集lonexplain和传统分类器作为基准。
{"title":"LonXplain: Lonesomeness as a Consequence of Mental Disturbance in Reddit Posts","authors":"Muskan Garg, Chandni Saxena, Debabrata Samanta, B. Dorr","doi":"10.48550/arXiv.2305.18736","DOIUrl":"https://doi.org/10.48550/arXiv.2305.18736","url":null,"abstract":"Social media is a potential source of information that infers latent mental states through Natural Language Processing (NLP). While narrating real-life experiences, social media users convey their feeling of loneliness or isolated lifestyle, impacting their mental well-being. Existing literature on psychological theories points to loneliness as the major consequence of interpersonal risk factors, propounding the need to investigate loneliness as a major aspect of mental disturbance. We formulate lonesomeness detection in social media posts as an explainable binary classification problem, discovering the users at-risk, suggesting the need of resilience for early control. To the best of our knowledge, there is no existing explainable dataset, i.e., one with human-readable, annotated text spans, to facilitate further research and development in loneliness detection causing mental disturbance. In this work, three experts: a senior clinical psychologist, a rehabilitation counselor, and a social NLP researcher define annotation schemes and perplexity guidelines to mark the presence or absence of lonesomeness, along with the marking of text-spans in original posts as explanation, in 3,521 Reddit posts. We expect the public release of our dataset, LonXplain, and traditional classifiers as baselines via GitHub.","PeriodicalId":136374,"journal":{"name":"International Conference on Applications of Natural Language to Data Bases","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126828461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Few-shot Approach to Resume Information Extraction via Prompts 通过提示提取简历信息的几种方法
Pub Date : 2022-09-20 DOI: 10.1007/978-3-031-35320-8_32
Chengguang Gan, Tatsunori Mori
{"title":"A Few-shot Approach to Resume Information Extraction via Prompts","authors":"Chengguang Gan, Tatsunori Mori","doi":"10.1007/978-3-031-35320-8_32","DOIUrl":"https://doi.org/10.1007/978-3-031-35320-8_32","url":null,"abstract":"","PeriodicalId":136374,"journal":{"name":"International Conference on Applications of Natural Language to Data Bases","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122789528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Zero and Few-shot Learning for Author Profiling 作者分析的零和少射学习
Pub Date : 2022-04-22 DOI: 10.48550/arXiv.2204.10543
Mara Chinea-Rios, Thomas Müller, Gretel Liz De la Pena Sarrac'en, Francisco Rangel, Marc Franco-Salvador
Author profiling classifies author characteristics by analyzing how language is shared among people. In this work, we study that task from a low-resource viewpoint: using little or no training data. We explore different zero and few-shot models based on entailment and evaluate our systems on several profiling tasks in Spanish and English. In addition, we study the effect of both the entailment hypothesis and the size of the few-shot training sample. We find that entailment-based models out-perform supervised text classifiers based on roberta-XLM and that we can reach 80% of the accuracy of previous approaches using less than 50% of the training data on average.
作者特征分析通过分析语言如何在人群中共享来对作者特征进行分类。在这项工作中,我们从低资源的角度来研究这个任务:使用很少或没有训练数据。我们基于蕴涵探索了不同的零射击和少射击模型,并在西班牙语和英语的几个分析任务中评估了我们的系统。此外,我们还研究了蕴涵假设和少投训练样本大小的影响。我们发现基于蕴涵的模型优于基于roberta-XLM的监督文本分类器,并且平均使用不到50%的训练数据,我们可以达到以前方法的80%的准确率。
{"title":"Zero and Few-shot Learning for Author Profiling","authors":"Mara Chinea-Rios, Thomas Müller, Gretel Liz De la Pena Sarrac'en, Francisco Rangel, Marc Franco-Salvador","doi":"10.48550/arXiv.2204.10543","DOIUrl":"https://doi.org/10.48550/arXiv.2204.10543","url":null,"abstract":"Author profiling classifies author characteristics by analyzing how language is shared among people. In this work, we study that task from a low-resource viewpoint: using little or no training data. We explore different zero and few-shot models based on entailment and evaluate our systems on several profiling tasks in Spanish and English. In addition, we study the effect of both the entailment hypothesis and the size of the few-shot training sample. We find that entailment-based models out-perform supervised text classifiers based on roberta-XLM and that we can reach 80% of the accuracy of previous approaches using less than 50% of the training data on average.","PeriodicalId":136374,"journal":{"name":"International Conference on Applications of Natural Language to Data Bases","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127937314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Metric Learning and Adaptive Boundary for Out-of-Domain Detection 域外检测的度量学习和自适应边界
Pub Date : 2022-04-22 DOI: 10.48550/arXiv.2204.10849
Petr Lorenc, Tommaso Gargiani, Jan Pichl, Jakub Konrád, Petro Marek, Ondrej Kobza, J. Sedivý
Conversational agents are usually designed for closed-world environments. Unfortunately, users can behave unexpectedly. Based on the open-world environment, we often encounter the situation that the training and test data are sampled from different distributions. Then, data from different distributions are called out-of-domain (OOD). A robust conversational agent needs to react to these OOD utterances adequately. Thus, the importance of robust OOD detection is emphasized. Unfortunately, collecting OOD data is a challenging task. We have designed an OOD detection algorithm independent of OOD data that outperforms a wide range of current state-of-the-art algorithms on publicly available datasets. Our algorithm is based on a simple but efficient approach of combining metric learning with adaptive decision boundary. Furthermore, compared to other algorithms, we have found that our proposed algorithm has significantly improved OOD performance in a scenario with a lower number of classes while preserving the accuracy for in-domain (IND) classes.
会话代理通常是为封闭环境设计的。不幸的是,用户的行为可能出乎意料。基于开放世界环境,我们经常会遇到训练数据和测试数据来自不同分布的情况。然后,来自不同分布的数据被称为域外(OOD)。一个健壮的会话代理需要对这些OOD话语做出充分的反应。因此,强调了鲁棒OOD检测的重要性。不幸的是,收集OOD数据是一项具有挑战性的任务。我们设计了一种独立于OOD数据的OOD检测算法,该算法在公开可用的数据集上优于当前各种最先进的算法。我们的算法基于一种简单而有效的方法,将度量学习与自适应决策边界相结合。此外,与其他算法相比,我们发现我们提出的算法在类数量较少的场景下显著提高了OOD性能,同时保持了域内(IND)类的准确性。
{"title":"Metric Learning and Adaptive Boundary for Out-of-Domain Detection","authors":"Petr Lorenc, Tommaso Gargiani, Jan Pichl, Jakub Konrád, Petro Marek, Ondrej Kobza, J. Sedivý","doi":"10.48550/arXiv.2204.10849","DOIUrl":"https://doi.org/10.48550/arXiv.2204.10849","url":null,"abstract":"Conversational agents are usually designed for closed-world environments. Unfortunately, users can behave unexpectedly. Based on the open-world environment, we often encounter the situation that the training and test data are sampled from different distributions. Then, data from different distributions are called out-of-domain (OOD). A robust conversational agent needs to react to these OOD utterances adequately. Thus, the importance of robust OOD detection is emphasized. Unfortunately, collecting OOD data is a challenging task. We have designed an OOD detection algorithm independent of OOD data that outperforms a wide range of current state-of-the-art algorithms on publicly available datasets. Our algorithm is based on a simple but efficient approach of combining metric learning with adaptive decision boundary. Furthermore, compared to other algorithms, we have found that our proposed algorithm has significantly improved OOD performance in a scenario with a lower number of classes while preserving the accuracy for in-domain (IND) classes.","PeriodicalId":136374,"journal":{"name":"International Conference on Applications of Natural Language to Data Bases","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125438984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detecting early signs of depression in the conversational domain: The role of transfer learning in low-resource scenarios 在会话领域发现抑郁的早期迹象:低资源情境下迁移学习的作用
Pub Date : 2022-04-22 DOI: 10.48550/arXiv.2204.10841
Petr Lorenc, Ana Sabina Uban, Paolo Rosso, Jan vSediv'y
. The high prevalence of depression in society has given rise to the need for new digital tools to assist in its early detection. To this end, existing research has mainly focused on detecting depression in the domain of social media, where there is a sufficient amount of data. How-ever, with the rise of conversational agents like Siri or Alexa, the conversational domain is becoming more critical. Unfortunately, there is a lack of data in the conversational domain. We perform a study focusing on domain adaptation from social media to the conversational domain. Our approach mainly exploits the linguistic information preserved in the vector representation of text. We describe transfer learning techniques to classify users who suffer from early signs of depression with high recall. We achieve state-of-the-art results on a commonly used conversational dataset, and we highlight how the method can easily be used in conversational agents. We publicly release all source code 5 .
. 抑郁症在社会上的高流行率引起了对新的数字工具的需求,以协助其早期发现。为此,现有的研究主要集中在社交媒体领域检测抑郁症,这方面有足够的数据。然而,随着Siri或Alexa等对话代理的兴起,对话领域变得越来越重要。不幸的是,在会话领域缺乏数据。我们进行了一项研究,侧重于从社交媒体到会话领域的领域适应。我们的方法主要利用文本向量表示中保留的语言信息。我们描述了迁移学习技术,以分类用户谁遭受抑郁症的早期迹象与高回忆。我们在一个常用的会话数据集上获得了最先进的结果,并强调了该方法如何轻松地用于会话代理。我们公开发布所有源代码。
{"title":"Detecting early signs of depression in the conversational domain: The role of transfer learning in low-resource scenarios","authors":"Petr Lorenc, Ana Sabina Uban, Paolo Rosso, Jan vSediv'y","doi":"10.48550/arXiv.2204.10841","DOIUrl":"https://doi.org/10.48550/arXiv.2204.10841","url":null,"abstract":". The high prevalence of depression in society has given rise to the need for new digital tools to assist in its early detection. To this end, existing research has mainly focused on detecting depression in the domain of social media, where there is a sufficient amount of data. How-ever, with the rise of conversational agents like Siri or Alexa, the conversational domain is becoming more critical. Unfortunately, there is a lack of data in the conversational domain. We perform a study focusing on domain adaptation from social media to the conversational domain. Our approach mainly exploits the linguistic information preserved in the vector representation of text. We describe transfer learning techniques to classify users who suffer from early signs of depression with high recall. We achieve state-of-the-art results on a commonly used conversational dataset, and we highlight how the method can easily be used in conversational agents. We publicly release all source code 5 .","PeriodicalId":136374,"journal":{"name":"International Conference on Applications of Natural Language to Data Bases","volume":"368 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116621841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Unsupervised Ranking and Aggregation of Label Descriptions for Zero-Shot Classifiers 零概率分类器标签描述的无监督排序和聚合
Pub Date : 2022-04-20 DOI: 10.48550/arXiv.2204.09481
Angelo Basile, Marc Franco-Salvador, Paolo Rosso
Zero-shot text classifiers based on label descriptions embed an input text and a set of labels into the same space: measures such as cosine similarity can then be used to select the most similar label description to the input text as the predicted label. In a true zero-shot setup, designing good label descriptions is challenging because no development set is available. Inspired by the literature on Learning with Disagreements, we look at how probabilistic models of repeated rating analysis can be used for selecting the best label descriptions in an unsupervised fashion. We evaluate our method on a set of diverse datasets and tasks (sentiment, topic and stance). Furthermore, we show that multiple, noisy label descriptions can be aggregated to boost the performance.
基于标签描述的Zero-shot文本分类器将输入文本和一组标签嵌入到相同的空间中:然后可以使用余弦相似度等度量来选择与输入文本最相似的标签描述作为预测标签。在真正的零尝试设置中,设计良好的标签描述是具有挑战性的,因为没有可用的开发集。受关于分歧学习的文献的启发,我们研究了如何使用重复评级分析的概率模型以无监督的方式选择最佳标签描述。我们在一组不同的数据集和任务(情感、主题和立场)上评估我们的方法。此外,我们还展示了可以聚合多个有噪声的标签描述以提高性能。
{"title":"Unsupervised Ranking and Aggregation of Label Descriptions for Zero-Shot Classifiers","authors":"Angelo Basile, Marc Franco-Salvador, Paolo Rosso","doi":"10.48550/arXiv.2204.09481","DOIUrl":"https://doi.org/10.48550/arXiv.2204.09481","url":null,"abstract":"Zero-shot text classifiers based on label descriptions embed an input text and a set of labels into the same space: measures such as cosine similarity can then be used to select the most similar label description to the input text as the predicted label. In a true zero-shot setup, designing good label descriptions is challenging because no development set is available. Inspired by the literature on Learning with Disagreements, we look at how probabilistic models of repeated rating analysis can be used for selecting the best label descriptions in an unsupervised fashion. We evaluate our method on a set of diverse datasets and tasks (sentiment, topic and stance). Furthermore, we show that multiple, noisy label descriptions can be aggregated to boost the performance.","PeriodicalId":136374,"journal":{"name":"International Conference on Applications of Natural Language to Data Bases","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121472472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Active Few-Shot Learning with FASL 使用FASL进行主动的少镜头学习
Pub Date : 2022-04-20 DOI: 10.48550/arXiv.2204.09347
Thomas Müller, Guillermo P'erez-Torr'o, Angelo Basile, Marc Franco-Salvador
Recent advances in natural language processing (NLP) have led to strong text classification models for many tasks. However, still often thousands of examples are needed to train models with good quality. This makes it challenging to quickly develop and deploy new models for real world problems and business needs. Few-shot learning and active learning are two lines of research, aimed at tackling this problem. In this work, we combine both lines into FASL, a platform that allows training text classification models using an iterative and fast process. We investigate which active learning methods work best in our few-shot setup. Additionally, we develop a model to predict when to stop annotating. This is relevant as in a few-shot setup we do not have access to a large validation set.
自然语言处理(NLP)的最新进展导致了许多任务的强文本分类模型。然而,通常仍然需要成千上万的例子来训练高质量的模型。这使得为现实世界的问题和业务需求快速开发和部署新模型具有挑战性。少量学习和主动学习是旨在解决这一问题的两条研究路线。在这项工作中,我们将这两行结合到FASL中,FASL是一个允许使用迭代和快速过程训练文本分类模型的平台。我们研究了哪种主动学习方法在我们的几次设置中效果最好。此外,我们开发了一个模型来预测何时停止注释。这是相关的,因为在几次设置中,我们无法访问大型验证集。
{"title":"Active Few-Shot Learning with FASL","authors":"Thomas Müller, Guillermo P'erez-Torr'o, Angelo Basile, Marc Franco-Salvador","doi":"10.48550/arXiv.2204.09347","DOIUrl":"https://doi.org/10.48550/arXiv.2204.09347","url":null,"abstract":"Recent advances in natural language processing (NLP) have led to strong text classification models for many tasks. However, still often thousands of examples are needed to train models with good quality. This makes it challenging to quickly develop and deploy new models for real world problems and business needs. Few-shot learning and active learning are two lines of research, aimed at tackling this problem. In this work, we combine both lines into FASL, a platform that allows training text classification models using an iterative and fast process. We investigate which active learning methods work best in our few-shot setup. Additionally, we develop a model to predict when to stop annotating. This is relevant as in a few-shot setup we do not have access to a large validation set.","PeriodicalId":136374,"journal":{"name":"International Conference on Applications of Natural Language to Data Bases","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133641098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Named Entity Recognition for Partially Annotated Datasets 部分注释数据集的命名实体识别
Pub Date : 2022-04-19 DOI: 10.48550/arXiv.2204.09081
Michael Strobl, Amine Trabelsi, Osmar R Zaiane
The most common Named Entity Recognizers are usually sequence taggers trained on fully annotated corpora, i.e. the class of all words for all entities is known. Partially annotated corpora, i.e. some but not all entities of some types are annotated, are too noisy for training sequence taggers since the same entity may be annotated one time with its true type but not another time, misleading the tagger. Therefore, we are comparing three training strategies for partially annotated datasets and an approach to derive new datasets for new classes of entities from Wikipedia without time-consuming manual data annotation. In order to properly verify that our data acquisition and training approaches are plausible, we manually annotated test datasets for two new classes, namely food and drugs.
最常见的命名实体识别器通常是在完全注释的语料库上训练的序列标记器,即所有实体的所有单词的类别都是已知的。部分标注的语料库,即某些类型的一些实体被标注,但不是所有实体都被标注,对于训练序列标注器来说太吵了,因为同一实体可能会用其真实类型标注一次,而不是另一次,这会误导标注器。因此,我们比较了三种针对部分注释数据集的训练策略,以及一种无需耗时的手动数据注释就能从维基百科中为新类别的实体派生新数据集的方法。为了正确验证我们的数据采集和训练方法是合理的,我们手动注释了两个新类别的测试数据集,即食品和药品。
{"title":"Named Entity Recognition for Partially Annotated Datasets","authors":"Michael Strobl, Amine Trabelsi, Osmar R Zaiane","doi":"10.48550/arXiv.2204.09081","DOIUrl":"https://doi.org/10.48550/arXiv.2204.09081","url":null,"abstract":"The most common Named Entity Recognizers are usually sequence taggers trained on fully annotated corpora, i.e. the class of all words for all entities is known. Partially annotated corpora, i.e. some but not all entities of some types are annotated, are too noisy for training sequence taggers since the same entity may be annotated one time with its true type but not another time, misleading the tagger. Therefore, we are comparing three training strategies for partially annotated datasets and an approach to derive new datasets for new classes of entities from Wikipedia without time-consuming manual data annotation. In order to properly verify that our data acquisition and training approaches are plausible, we manually annotated test datasets for two new classes, namely food and drugs.","PeriodicalId":136374,"journal":{"name":"International Conference on Applications of Natural Language to Data Bases","volume":"168 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115716063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Conference on Applications of Natural Language to Data Bases
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1