首页 > 最新文献

Recent Advances in Natural Language Processing最新文献

英文 中文
Are the Multilingual Models Better? Improving Czech Sentiment with Transformers 多语言模式更好吗?用变形金刚改善捷克人的情绪
Pub Date : 2021-08-24 DOI: 10.26615/978-954-452-072-4_128
P. Pribán, J. Steinberger
In this paper, we aim at improving Czech sentiment with transformer-based models and their multilingual versions. More concretely, we study the task of polarity detection for the Czech language on three sentiment polarity datasets. We fine-tune and perform experiments with five multilingual and three monolingual models. We compare the monolingual and multilingual models’ performance, including comparison with the older approach based on recurrent neural networks. Furthermore, we test the multilingual models and their ability to transfer knowledge from English to Czech (and vice versa) with zero-shot cross-lingual classification. Our experiments show that the huge multilingual models can overcome the performance of the monolingual models. They are also able to detect polarity in another language without any training data, with performance not worse than 4.4 % compared to state-of-the-art monolingual trained models. Moreover, we achieved new state-of-the-art results on all three datasets.
在本文中,我们的目标是通过基于转换器的模型及其多语言版本来改善捷克语的情感。更具体地说,我们在三个情感极性数据集上研究了捷克语的极性检测任务。我们对5个多语言模型和3个单语言模型进行了微调和实验。我们比较了单语言和多语言模型的性能,包括与基于递归神经网络的旧方法的比较。此外,我们测试了多语言模型及其使用零概率跨语言分类将知识从英语转移到捷克语(反之亦然)的能力。我们的实验表明,庞大的多语言模型可以克服单语言模型的性能。它们还能够在没有任何训练数据的情况下检测另一种语言的极性,与最先进的单语训练模型相比,其性能不低于4.4%。此外,我们在所有三个数据集上都取得了新的最先进的结果。
{"title":"Are the Multilingual Models Better? Improving Czech Sentiment with Transformers","authors":"P. Pribán, J. Steinberger","doi":"10.26615/978-954-452-072-4_128","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_128","url":null,"abstract":"In this paper, we aim at improving Czech sentiment with transformer-based models and their multilingual versions. More concretely, we study the task of polarity detection for the Czech language on three sentiment polarity datasets. We fine-tune and perform experiments with five multilingual and three monolingual models. We compare the monolingual and multilingual models’ performance, including comparison with the older approach based on recurrent neural networks. Furthermore, we test the multilingual models and their ability to transfer knowledge from English to Czech (and vice versa) with zero-shot cross-lingual classification. Our experiments show that the huge multilingual models can overcome the performance of the monolingual models. They are also able to detect polarity in another language without any training data, with performance not worse than 4.4 % compared to state-of-the-art monolingual trained models. Moreover, we achieved new state-of-the-art results on all three datasets.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126453117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Improving Distantly Supervised Relation Extraction with Self-Ensemble Noise Filtering 自集成噪声滤波改进远程监督关系提取
Pub Date : 2021-08-22 DOI: 10.26615/978-954-452-072-4_116
Tapas Nayak, Navonil Majumder, Soujanya Poria
Distantly supervised models are very popular for relation extraction since we can obtain a large amount of training data using the distant supervision method without human annotation. In distant supervision, a sentence is considered as a source of a tuple if the sentence contains both entities of the tuple. However, this condition is too permissive and does not guarantee the presence of relevant relation-specific information in the sentence. As such, distantly supervised training data contains much noise which adversely affects the performance of the models. In this paper, we propose a self-ensemble filtering mechanism to filter out the noisy samples during the training process. We evaluate our proposed framework on the New York Times dataset which is obtained via distant supervision. Our experiments with multiple state-of-the-art neural relation extraction models show that our proposed filtering mechanism improves the robustness of the models and increases their F1 scores.
远程监督模型是一种非常流行的关系提取方法,因为我们可以使用远程监督方法获得大量的训练数据,而无需人工注释。在远程监督中,如果一个句子包含元组的两个实体,则该句子被视为元组的源。然而,这种条件过于宽松,不能保证在句子中存在相关的特定于关系的信息。因此,远程监督训练数据包含大量噪声,这对模型的性能有不利影响。在本文中,我们提出了一种自集合滤波机制来滤除训练过程中的噪声样本。我们在通过远程监督获得的《纽约时报》数据集上评估了我们提出的框架。我们对多个最先进的神经关系提取模型进行的实验表明,我们提出的过滤机制提高了模型的鲁棒性,并提高了它们的F1分数。
{"title":"Improving Distantly Supervised Relation Extraction with Self-Ensemble Noise Filtering","authors":"Tapas Nayak, Navonil Majumder, Soujanya Poria","doi":"10.26615/978-954-452-072-4_116","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_116","url":null,"abstract":"Distantly supervised models are very popular for relation extraction since we can obtain a large amount of training data using the distant supervision method without human annotation. In distant supervision, a sentence is considered as a source of a tuple if the sentence contains both entities of the tuple. However, this condition is too permissive and does not guarantee the presence of relevant relation-specific information in the sentence. As such, distantly supervised training data contains much noise which adversely affects the performance of the models. In this paper, we propose a self-ensemble filtering mechanism to filter out the noisy samples during the training process. We evaluate our proposed framework on the New York Times dataset which is obtained via distant supervision. Our experiments with multiple state-of-the-art neural relation extraction models show that our proposed filtering mechanism improves the robustness of the models and increases their F1 scores.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134450530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Not All Linearizations Are Equally Data-Hungry in Sequence Labeling Parsing 在序列标记解析中,并非所有的线性化都同样需要数据
Pub Date : 2021-08-17 DOI: 10.26615/978-954-452-072-4_111
Alberto Muñoz-Ortiz, Michalina Strzyz, David Vilares
Different linearizations have been proposed to cast dependency parsing as sequence labeling and solve the task as: (i) a head selection problem, (ii) finding a representation of the token arcs as bracket strings, or (iii) associating partial transition sequences of a transition-based parser to words. Yet, there is little understanding about how these linearizations behave in low-resource setups. Here, we first study their data efficiency, simulating data-restricted setups from a diverse set of rich-resource treebanks. Second, we test whether such differences manifest in truly low-resource setups. The results show that head selection encodings are more data-efficient and perform better in an ideal (gold) framework, but that such advantage greatly vanishes in favour of bracketing formats when the running setup resembles a real-world low-resource configuration.
已经提出了不同的线性化,将依赖解析转换为序列标记,并将任务解决为:(i)头部选择问题,(ii)寻找标记弧作为括号字符串的表示,或(iii)将基于转换的解析器的部分转换序列关联到单词。然而,人们对这些线性化在低资源环境下的表现知之甚少。在这里,我们首先研究它们的数据效率,模拟来自各种丰富资源树库的数据限制设置。其次,我们测试这些差异是否在真正的低资源设置中表现出来。结果表明,头部选择编码在理想(黄金)框架中具有更高的数据效率和更好的性能,但是当运行设置类似于现实世界的低资源配置时,这种优势在括号格式中大大消失。
{"title":"Not All Linearizations Are Equally Data-Hungry in Sequence Labeling Parsing","authors":"Alberto Muñoz-Ortiz, Michalina Strzyz, David Vilares","doi":"10.26615/978-954-452-072-4_111","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_111","url":null,"abstract":"Different linearizations have been proposed to cast dependency parsing as sequence labeling and solve the task as: (i) a head selection problem, (ii) finding a representation of the token arcs as bracket strings, or (iii) associating partial transition sequences of a transition-based parser to words. Yet, there is little understanding about how these linearizations behave in low-resource setups. Here, we first study their data efficiency, simulating data-restricted setups from a diverse set of rich-resource treebanks. Second, we test whether such differences manifest in truly low-resource setups. The results show that head selection encodings are more data-efficient and perform better in an ideal (gold) framework, but that such advantage greatly vanishes in favour of bracketing formats when the running setup resembles a real-world low-resource configuration.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127745053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
AutoChart: A Dataset for Chart-to-Text Generation Task AutoChart:用于图表到文本生成任务的数据集
Pub Date : 2021-08-16 DOI: 10.26615/978-954-452-072-4_183
Jiawen Zhu, Jinye Ran, R. Lee, Kenny Choo, Zhi Li
The analytical description of charts is an exciting and important research area with many applications in academia and industry. Yet, this challenging task has received limited attention from the computational linguistics research community. This paper proposes AutoChart, a large dataset for the analytical description of charts, which aims to encourage more research into this important area. Specifically, we offer a novel framework that generates the charts and their analytical description automatically. We conducted extensive human and machine evaluation on the generated charts and descriptions and demonstrate that the generated texts are informative, coherent, and relevant to the corresponding charts.
图表的分析描述是一个令人兴奋和重要的研究领域,在学术界和工业界都有许多应用。然而,这一具有挑战性的任务得到了计算语言学研究界的有限关注。本文提出了AutoChart,一个用于图表分析描述的大型数据集,旨在鼓励对这一重要领域进行更多的研究。具体来说,我们提供了一个新的框架,可以自动生成图表及其分析描述。我们对生成的图表和描述进行了广泛的人工和机器评估,并证明生成的文本信息丰富、连贯且与相应的图表相关。
{"title":"AutoChart: A Dataset for Chart-to-Text Generation Task","authors":"Jiawen Zhu, Jinye Ran, R. Lee, Kenny Choo, Zhi Li","doi":"10.26615/978-954-452-072-4_183","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_183","url":null,"abstract":"The analytical description of charts is an exciting and important research area with many applications in academia and industry. Yet, this challenging task has received limited attention from the computational linguistics research community. This paper proposes AutoChart, a large dataset for the analytical description of charts, which aims to encourage more research into this important area. Specifically, we offer a novel framework that generates the charts and their analytical description automatically. We conducted extensive human and machine evaluation on the generated charts and descriptions and demonstrate that the generated texts are informative, coherent, and relevant to the corresponding charts.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129573929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Syntax Matters! Syntax-Controlled in Text Style Transfer 语法很重要!语法控制的文本样式转移
Pub Date : 2021-08-12 DOI: 10.26615/978-954-452-072-4_064
Zhiqiang Hu, R. Lee, C. Aggarwal
Existing text style transfer (TST) methods rely on style classifiers to disentangle the text’s content and style attributes for text style transfer. While the style classifier plays a critical role in existing TST methods, there is no known investigation on its effect on the TST methods. In this paper, we conduct an empirical study on the limitations of the style classifiers used in existing TST methods. We demonstrated that the existing style classifiers cannot learn sentence syntax effectively and ultimately worsen existing TST models’ performance. To address this issue, we propose a novel Syntax-Aware Controllable Generation (SACG) model, which includes a syntax-aware style classifier that ensures learned style latent representations effectively capture the sentence structure for TST. Through extensive experiments on two popular text style transfer tasks, we show that our proposed method significantly outperforms twelve state-of-the-art methods. Our case studies have also demonstrated SACG’s ability to generate fluent target-style sentences that preserved the original content.
现有的文本样式转移(TST)方法依赖于样式分类器来分离文本的内容和样式属性以进行文本样式转移。虽然风格分类器在现有的TST方法中起着至关重要的作用,但目前还没有对其对TST方法的影响进行研究。在本文中,我们对现有TST方法中使用的风格分类器的局限性进行了实证研究。我们证明了现有的风格分类器不能有效地学习句子语法,并最终恶化了现有的TST模型的性能。为了解决这个问题,我们提出了一种新的语法感知的可控生成(SACG)模型,该模型包括一个语法感知的风格分类器,以确保学习到的风格潜在表征有效地捕获TST的句子结构。通过对两种流行的文本风格迁移任务的广泛实验,我们表明我们提出的方法明显优于12种最先进的方法。我们的案例研究还证明了SACG能够生成流畅的目标风格句子,并保留了原始内容。
{"title":"Syntax Matters! Syntax-Controlled in Text Style Transfer","authors":"Zhiqiang Hu, R. Lee, C. Aggarwal","doi":"10.26615/978-954-452-072-4_064","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_064","url":null,"abstract":"Existing text style transfer (TST) methods rely on style classifiers to disentangle the text’s content and style attributes for text style transfer. While the style classifier plays a critical role in existing TST methods, there is no known investigation on its effect on the TST methods. In this paper, we conduct an empirical study on the limitations of the style classifiers used in existing TST methods. We demonstrated that the existing style classifiers cannot learn sentence syntax effectively and ultimately worsen existing TST models’ performance. To address this issue, we propose a novel Syntax-Aware Controllable Generation (SACG) model, which includes a syntax-aware style classifier that ensures learned style latent representations effectively capture the sentence structure for TST. Through extensive experiments on two popular text style transfer tasks, we show that our proposed method significantly outperforms twelve state-of-the-art methods. Our case studies have also demonstrated SACG’s ability to generate fluent target-style sentences that preserved the original content.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126191026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Decoupled Transformer for Scalable Inference in Open-domain Question Answering 开放域问答中可扩展推理的解耦变压器
Pub Date : 2021-08-05 DOI: 10.26615/978-954-452-072-4_044
Haytham ElFadeel, Stanislav Peshterliev
Large transformer models, such as BERT, achieve state-of-the-art results in machine reading comprehension (MRC) for open-domain question answering (QA). However, transformers have a high computational cost for inference which makes them hard to apply to online QA systems for applications like voice assistants. To reduce computational cost and latency, we propose decoupling the transformer MRC model into input-component and cross-component. The decoupling allows for part of the representation computation to be performed offline and cached for online use. To retain the decoupled transformer accuracy, we devised a knowledge distillation objective from a standard transformer model. Moreover, we introduce learned representation compression layers which help reduce by four times the storage requirement for the cache. In experiments on the SQUAD 2.0 dataset, a decoupled transformer reduces the computational cost and latency of open-domain MRC by 30-40% with only 1.2 points worse F1-score compared to a standard transformer.
大型变压器模型,如BERT,在开放域问答(QA)的机器阅读理解(MRC)中取得了最先进的结果。然而,变压器的推理计算成本很高,这使得它们很难应用于语音助手等应用程序的在线QA系统。为了减少计算成本和延迟,我们提出将变压器MRC模型解耦为输入组件和跨组件。这种解耦允许脱机执行部分表示计算,并缓存以供在线使用。为了保持解耦变压器的精度,我们设计了一个标准变压器模型的知识蒸馏目标。此外,我们引入了学习表示压缩层,这有助于将缓存的存储需求减少四倍。在SQUAD 2.0数据集上的实验中,与标准变压器相比,解耦变压器将开放域MRC的计算成本和延迟降低了30-40%,f1分数仅差1.2分。
{"title":"Decoupled Transformer for Scalable Inference in Open-domain Question Answering","authors":"Haytham ElFadeel, Stanislav Peshterliev","doi":"10.26615/978-954-452-072-4_044","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_044","url":null,"abstract":"Large transformer models, such as BERT, achieve state-of-the-art results in machine reading comprehension (MRC) for open-domain question answering (QA). However, transformers have a high computational cost for inference which makes them hard to apply to online QA systems for applications like voice assistants. To reduce computational cost and latency, we propose decoupling the transformer MRC model into input-component and cross-component. The decoupling allows for part of the representation computation to be performed offline and cached for online use. To retain the decoupled transformer accuracy, we devised a knowledge distillation objective from a standard transformer model. Moreover, we introduce learned representation compression layers which help reduce by four times the storage requirement for the cache. In experiments on the SQUAD 2.0 dataset, a decoupled transformer reduces the computational cost and latency of open-domain MRC by 30-40% with only 1.2 points worse F1-score compared to a standard transformer.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133748969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors PyEuroVoc:一个多语言法律文件分类工具,带有EuroVoc描述符
Pub Date : 2021-08-02 DOI: 10.26615/978-954-452-072-4_012
Andrei-Marius Avram, V. Pais, D. Tufis
EuroVoc is a multilingual thesaurus that was built for organizing the legislative documentary of the European Union institutions. It contains thousands of categories at different levels of specificity and its descriptors are targeted by legal texts in almost thirty languages. In this work we propose a unified framework for EuroVoc classification on 22 languages by fine-tuning modern Transformer-based pretrained language models. We study extensively the performance of our trained models and show that they significantly improve the results obtained by a similar tool - JEX - on the same dataset. The code and the fine-tuned models were open sourced, together with a programmatic interface that eases the process of loading the weights of a trained model and of classifying a new document.
EuroVoc是一个多语言词典,是为组织欧盟机构的立法文献而建立的。它包含数千种不同层次的具体类别,其描述符被近30种语言的法律文本所针对。在这项工作中,我们通过微调现代基于transformer的预训练语言模型,为22种语言的EuroVoc分类提出了一个统一的框架。我们广泛地研究了我们训练的模型的性能,并表明它们显着改善了类似工具- JEX -在相同数据集上获得的结果。代码和微调模型都是开源的,还有一个编程接口,可以简化加载训练模型的权重和对新文档进行分类的过程。
{"title":"PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors","authors":"Andrei-Marius Avram, V. Pais, D. Tufis","doi":"10.26615/978-954-452-072-4_012","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_012","url":null,"abstract":"EuroVoc is a multilingual thesaurus that was built for organizing the legislative documentary of the European Union institutions. It contains thousands of categories at different levels of specificity and its descriptors are targeted by legal texts in almost thirty languages. In this work we propose a unified framework for EuroVoc classification on 22 languages by fine-tuning modern Transformer-based pretrained language models. We study extensively the performance of our trained models and show that they significantly improve the results obtained by a similar tool - JEX - on the same dataset. The code and the fine-tuned models were open sourced, together with a programmatic interface that eases the process of loading the weights of a trained model and of classifying a new document.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128595572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Psychologically Informed Part-of-Speech Analysis of Depression in Social Media 社交媒体中抑郁症的心理信息词性分析
Pub Date : 2021-07-31 DOI: 10.26615/978-954-452-072-4_024
Ana-Maria Bucur, Ioana R. Podinua, Liviu P. Dinu
In this work, we provide an extensive part-of-speech analysis of the discourse of social media users with depression. Research in psychology revealed that depressed users tend to be self-focused, more preoccupied with themselves and ruminate more about their lives and emotions. Our work aims to make use of large-scale datasets and computational methods for a quantitative exploration of discourse. We use the publicly available depression dataset from the Early Risk Prediction on the Internet Workshop (eRisk) 2018 and extract part-of-speech features and several indices based on them. Our results reveal statistically significant differences between the depressed and non-depressed individuals confirming findings from the existing psychology literature. Our work provides insights regarding the way in which depressed individuals are expressing themselves on social media platforms, allowing for better-informed computational models to help monitor and prevent mental illnesses.
在这项工作中,我们对患有抑郁症的社交媒体用户的话语进行了广泛的词性分析。心理学研究表明,抑郁的用户往往以自我为中心,更专注于自己,更多地思考自己的生活和情感。我们的工作旨在利用大规模数据集和计算方法对话语进行定量探索。我们使用来自互联网研讨会(eRisk) 2018年早期风险预测的公开数据集,提取词性特征和基于它们的几个指标。我们的研究结果显示,抑郁和非抑郁个体之间存在统计学上的显著差异,证实了现有心理学文献的发现。我们的工作提供了关于抑郁症患者在社交媒体平台上表达自己的方式的见解,允许更好的信息计算模型来帮助监测和预防精神疾病。
{"title":"A Psychologically Informed Part-of-Speech Analysis of Depression in Social Media","authors":"Ana-Maria Bucur, Ioana R. Podinua, Liviu P. Dinu","doi":"10.26615/978-954-452-072-4_024","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_024","url":null,"abstract":"In this work, we provide an extensive part-of-speech analysis of the discourse of social media users with depression. Research in psychology revealed that depressed users tend to be self-focused, more preoccupied with themselves and ruminate more about their lives and emotions. Our work aims to make use of large-scale datasets and computational methods for a quantitative exploration of discourse. We use the publicly available depression dataset from the Early Risk Prediction on the Internet Workshop (eRisk) 2018 and extract part-of-speech features and several indices based on them. Our results reveal statistically significant differences between the depressed and non-depressed individuals confirming findings from the existing psychology literature. Our work provides insights regarding the way in which depressed individuals are expressing themselves on social media platforms, allowing for better-informed computational models to help monitor and prevent mental illnesses.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121662791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Multilingual Coreference Resolution with Harmonized Annotations 协调注释的多语言共同参考分辨率
Pub Date : 2021-07-26 DOI: 10.26615/978-954-452-072-4_125
O. Pražák, Miloslav Konopík, Jakub Sido
In this paper, we present coreference resolution experiments with a newly created multilingual corpus CorefUD (Nedoluzhko et al.,2021). We focus on the following languages: Czech, Russian, Polish, German, Spanish, and Catalan. In addition to monolingual experiments, we combine the training data in multilingual experiments and train two joined models - for Slavic languages and for all the languages together. We rely on an end-to-end deep learning model that we slightly adapted for the CorefUD corpus. Our results show that we can profit from harmonized annotations, and using joined models helps significantly for the languages with smaller training data.
在本文中,我们使用新创建的多语言语料库CorefUD (Nedoluzhko et al.,2021)进行了共参考解析实验。我们专注于以下语言:捷克语、俄语、波兰语、德语、西班牙语和加泰罗尼亚语。除了单语言实验,我们还将多语言实验中的训练数据结合起来,训练两个联合模型——针对斯拉夫语言和针对所有语言。我们依赖于端到端深度学习模型,我们对CorefUD语料库进行了稍微调整。我们的结果表明,我们可以从协调注释中获益,并且使用连接模型对具有较小训练数据的语言有很大帮助。
{"title":"Multilingual Coreference Resolution with Harmonized Annotations","authors":"O. Pražák, Miloslav Konopík, Jakub Sido","doi":"10.26615/978-954-452-072-4_125","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_125","url":null,"abstract":"In this paper, we present coreference resolution experiments with a newly created multilingual corpus CorefUD (Nedoluzhko et al.,2021). We focus on the following languages: Czech, Russian, Polish, German, Spanish, and Catalan. In addition to monolingual experiments, we combine the training data in multilingual experiments and train two joined models - for Slavic languages and for all the languages together. We rely on an end-to-end deep learning model that we slightly adapted for the CorefUD corpus. Our results show that we can profit from harmonized annotations, and using joined models helps significantly for the languages with smaller training data.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124948924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
BERT Embeddings for Automatic Readability Assessment 自动可读性评估的BERT嵌入
Pub Date : 2021-06-15 DOI: 10.26615/978-954-452-072-4_069
Joseph Marvin Imperial
Automatic readability assessment (ARA) is the task of evaluating the level of ease or difficulty of text documents for a target audience. For researchers, one of the many open problems in the field is to make such models trained for the task show efficacy even for low-resource languages. In this study, we propose an alternative way of utilizing the information-rich embeddings of BERT models with handcrafted linguistic features through a combined method for readability assessment. Results show that the proposed method outperforms classical approaches in readability assessment using English and Filipino datasets, obtaining as high as 12.4% increase in F1 performance. We also show that the general information encoded in BERT embeddings can be used as a substitute feature set for low-resource languages like Filipino with limited semantic and syntactic NLP tools to explicitly extract feature values for the task.
自动可读性评估(ARA)是为目标受众评估文本文档的难易程度的任务。对于研究人员来说,该领域的许多开放问题之一是使这些为任务训练的模型即使对低资源语言也有效。在这项研究中,我们提出了一种替代方法,通过一种结合可读性评估的方法,利用BERT模型与手工语言特征的丰富信息嵌入。结果表明,该方法在使用英语和菲律宾语数据集的可读性评估方面优于经典方法,F1性能提高高达12.4%。我们还表明,在BERT嵌入中编码的一般信息可以用作低资源语言(如菲律宾语)的替代特征集,这些语言具有有限的语义和句法NLP工具,可以显式地提取任务的特征值。
{"title":"BERT Embeddings for Automatic Readability Assessment","authors":"Joseph Marvin Imperial","doi":"10.26615/978-954-452-072-4_069","DOIUrl":"https://doi.org/10.26615/978-954-452-072-4_069","url":null,"abstract":"Automatic readability assessment (ARA) is the task of evaluating the level of ease or difficulty of text documents for a target audience. For researchers, one of the many open problems in the field is to make such models trained for the task show efficacy even for low-resource languages. In this study, we propose an alternative way of utilizing the information-rich embeddings of BERT models with handcrafted linguistic features through a combined method for readability assessment. Results show that the proposed method outperforms classical approaches in readability assessment using English and Filipino datasets, obtaining as high as 12.4% increase in F1 performance. We also show that the general information encoded in BERT embeddings can be used as a substitute feature set for low-resource languages like Filipino with limited semantic and syntactic NLP tools to explicitly extract feature values for the task.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128198656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
期刊
Recent Advances in Natural Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1