跨域测量、单元和上下文提取的多源(预)训练

Workshop on Biomedical Natural Language Processing Pub Date : 2023-08-05 DOI:10.18653/v1/2023.bionlp-1.1

Yueling Li, Sebastian Martschat, Simone Paolo Ponzetto

{"title":"跨域测量、单元和上下文提取的多源(预)训练","authors":"Yueling Li, Sebastian Martschat, Simone Paolo Ponzetto","doi":"10.18653/v1/2023.bionlp-1.1","DOIUrl":null,"url":null,"abstract":"We present a cross-domain approach for automated measurement and context extraction based on pre-trained language models. We construct a multi-source, multi-domain corpus and train an end-to-end extraction pipeline. We then apply multi-source task-adaptive pre-training and fine-tuning to benchmark the cross-domain generalization capability of our model. Further, we conceptualize and apply a task-specific error analysis and derive insights for future work. Our results suggest that multi-source training leads to the best overall results, while single-source training yields the best results for the respective individual domain. While our setup is successful at extracting quantity values and units, more research is needed to improve the extraction of contextual entities. We make the cross-domain corpus used in this work available online.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Source (Pre-)Training for Cross-Domain Measurement, Unit and Context Extraction\",\"authors\":\"Yueling Li, Sebastian Martschat, Simone Paolo Ponzetto\",\"doi\":\"10.18653/v1/2023.bionlp-1.1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a cross-domain approach for automated measurement and context extraction based on pre-trained language models. We construct a multi-source, multi-domain corpus and train an end-to-end extraction pipeline. We then apply multi-source task-adaptive pre-training and fine-tuning to benchmark the cross-domain generalization capability of our model. Further, we conceptualize and apply a task-specific error analysis and derive insights for future work. Our results suggest that multi-source training leads to the best overall results, while single-source training yields the best results for the respective individual domain. While our setup is successful at extracting quantity values and units, more research is needed to improve the extraction of contextual entities. We make the cross-domain corpus used in this work available online.\",\"PeriodicalId\":200974,\"journal\":{\"name\":\"Workshop on Biomedical Natural Language Processing\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on Biomedical Natural Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2023.bionlp-1.1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Biomedical Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2023.bionlp-1.1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们提出了一种基于预训练语言模型的自动测量和上下文提取的跨域方法。我们构建了一个多源、多领域的语料库，并训练了一个端到端的抽取管道。然后，我们应用多源任务自适应预训练和微调来测试我们模型的跨域泛化能力。此外，我们概念化并应用特定于任务的错误分析，并为未来的工作提供见解。我们的结果表明，多源训练产生了最好的整体结果，而单源训练在各自的领域产生了最好的结果。虽然我们的设置在提取数量值和单位方面是成功的，但需要更多的研究来改进上下文实体的提取。我们将这项工作中使用的跨领域语料库在线提供。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multi-Source (Pre-)Training for Cross-Domain Measurement, Unit and Context Extraction

We present a cross-domain approach for automated measurement and context extraction based on pre-trained language models. We construct a multi-source, multi-domain corpus and train an end-to-end extraction pipeline. We then apply multi-source task-adaptive pre-training and fine-tuning to benchmark the cross-domain generalization capability of our model. Further, we conceptualize and apply a task-specific error analysis and derive insights for future work. Our results suggest that multi-source training leads to the best overall results, while single-source training yields the best results for the respective individual domain. While our setup is successful at extracting quantity values and units, more research is needed to improve the extraction of contextual entities. We make the cross-domain corpus used in this work available online.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop on Biomedical Natural Language Processing

自引率

0.00%

发文量