金融领域的多语种预训练语言模型

2022 14th International Conference on Knowledge and Systems Engineering (KSE) Pub Date : 2022-10-19 DOI:10.1109/KSE56063.2022.9953749

Duong Nguyen, Nam Cao, Son Nguyen, S. ta, C. Dinh

{"title":"金融领域的多语种预训练语言模型","authors":"Duong Nguyen, Nam Cao, Son Nguyen, S. ta, C. Dinh","doi":"10.1109/KSE56063.2022.9953749","DOIUrl":null,"url":null,"abstract":"There has been an increasing demand for good semantic representations of text in the financial sector when solving natural language processing tasks in Fintech. Previous work has shown that widely used modern language models trained in the general domain often perform poorly in this particular domain. There have been attempts to overcome this limitation by introducing domain-specific language models learned from financial text. However, these approaches suffer from the lack of in-domain data, which is further exacerbated for languages other than English. These problems motivate us to develop a simple and efficient pipeline to extract large amounts of financial text from large-scale multilingual corpora such as OSCAR and C4. We conduct extensive experiments with various downstream tasks in three different languages to demonstrate the effectiveness of our approach across a wide range of standard benchmarks.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MFinBERT: Multilingual Pretrained Language Model For Financial Domain\",\"authors\":\"Duong Nguyen, Nam Cao, Son Nguyen, S. ta, C. Dinh\",\"doi\":\"10.1109/KSE56063.2022.9953749\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There has been an increasing demand for good semantic representations of text in the financial sector when solving natural language processing tasks in Fintech. Previous work has shown that widely used modern language models trained in the general domain often perform poorly in this particular domain. There have been attempts to overcome this limitation by introducing domain-specific language models learned from financial text. However, these approaches suffer from the lack of in-domain data, which is further exacerbated for languages other than English. These problems motivate us to develop a simple and efficient pipeline to extract large amounts of financial text from large-scale multilingual corpora such as OSCAR and C4. We conduct extensive experiments with various downstream tasks in three different languages to demonstrate the effectiveness of our approach across a wide range of standard benchmarks.\",\"PeriodicalId\":330865,\"journal\":{\"name\":\"2022 14th International Conference on Knowledge and Systems Engineering (KSE)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 14th International Conference on Knowledge and Systems Engineering (KSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/KSE56063.2022.9953749\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE56063.2022.9953749","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在解决金融科技领域的自然语言处理任务时，金融领域对文本的良好语义表示的需求越来越大。以前的工作表明，在一般领域训练的广泛使用的现代语言模型在这个特定领域的表现往往很差。已经有人尝试通过引入从金融文本中学习的领域特定语言模型来克服这一限制。然而，这些方法受到缺乏领域内数据的影响，对于英语以外的语言，这种情况进一步加剧。这些问题促使我们开发一种简单高效的管道，从OSCAR和C4等大型多语言语料库中提取大量金融文本。我们用三种不同的语言对各种下游任务进行了广泛的实验，以证明我们的方法在广泛的标准基准测试中的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MFinBERT: Multilingual Pretrained Language Model For Financial Domain

There has been an increasing demand for good semantic representations of text in the financial sector when solving natural language processing tasks in Fintech. Previous work has shown that widely used modern language models trained in the general domain often perform poorly in this particular domain. There have been attempts to overcome this limitation by introducing domain-specific language models learned from financial text. However, these approaches suffer from the lack of in-domain data, which is further exacerbated for languages other than English. These problems motivate us to develop a simple and efficient pipeline to extract large amounts of financial text from large-scale multilingual corpora such as OSCAR and C4. We conduct extensive experiments with various downstream tasks in three different languages to demonstrate the effectiveness of our approach across a wide range of standard benchmarks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 14th International Conference on Knowledge and Systems Engineering (KSE)

自引率

0.00%

发文量

期刊最新文献

DWEN: A novel method for accurate estimation of cell type compositions from bulk data samples Polygenic risk scores adaptation for Height in a Vietnamese population Sentiment Classification for Beauty-fashion Reviews An Automated Stub Method for Unit Testing C/C++ Projects Knowledge-based Problem Solving and Reasoning methods