为波兰语开发 PUGG:构建 KBQA、MRC 和 IR 数据集的现代方法

Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz
{"title":"为波兰语开发 PUGG:构建 KBQA、MRC 和 IR 数据集的现代方法","authors":"Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz","doi":"arxiv-2408.02337","DOIUrl":null,"url":null,"abstract":"Advancements in AI and natural language processing have revolutionized\nmachine-human language interactions, with question answering (QA) systems\nplaying a pivotal role. The knowledge base question answering (KBQA) task,\nutilizing structured knowledge graphs (KG), allows for handling extensive\nknowledge-intensive questions. However, a significant gap exists in KBQA\ndatasets, especially for low-resource languages. Many existing construction\npipelines for these datasets are outdated and inefficient in human labor, and\nmodern assisting tools like Large Language Models (LLM) are not utilized to\nreduce the workload. To address this, we have designed and implemented a\nmodern, semi-automated approach for creating datasets, encompassing tasks such\nas KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR),\ntailored explicitly for low-resource environments. We executed this pipeline\nand introduced the PUGG dataset, the first Polish KBQA dataset, and novel\ndatasets for MRC and IR. Additionally, we provide a comprehensive\nimplementation, insightful findings, detailed statistics, and evaluation of\nbaseline models.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction\",\"authors\":\"Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz\",\"doi\":\"arxiv-2408.02337\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Advancements in AI and natural language processing have revolutionized\\nmachine-human language interactions, with question answering (QA) systems\\nplaying a pivotal role. The knowledge base question answering (KBQA) task,\\nutilizing structured knowledge graphs (KG), allows for handling extensive\\nknowledge-intensive questions. However, a significant gap exists in KBQA\\ndatasets, especially for low-resource languages. Many existing construction\\npipelines for these datasets are outdated and inefficient in human labor, and\\nmodern assisting tools like Large Language Models (LLM) are not utilized to\\nreduce the workload. To address this, we have designed and implemented a\\nmodern, semi-automated approach for creating datasets, encompassing tasks such\\nas KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR),\\ntailored explicitly for low-resource environments. We executed this pipeline\\nand introduced the PUGG dataset, the first Polish KBQA dataset, and novel\\ndatasets for MRC and IR. Additionally, we provide a comprehensive\\nimplementation, insightful findings, detailed statistics, and evaluation of\\nbaseline models.\",\"PeriodicalId\":501479,\"journal\":{\"name\":\"arXiv - CS - Artificial Intelligence\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.02337\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.02337","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

人工智能和自然语言处理技术的进步彻底改变了机器与人类之间的语言交互,其中问题解答(QA)系统发挥着举足轻重的作用。知识库问题解答(KBQA)任务利用结构化知识图谱(KG),可以处理大量知识密集型问题。然而,在知识库问题解答数据集方面存在很大差距,尤其是在低资源语言方面。这些数据集的许多现有构建管道已经过时,人力效率低下,而且没有利用大语言模型(LLM)等现代辅助工具来减少工作量。为了解决这个问题,我们设计并实施了一种现代的半自动化数据集创建方法,其中包括 KBQA、机器阅读理解(MRC)和信息检索(IR)等任务,专门为低资源环境量身定制。我们实施了这一流程,并推出了波兰首个 KBQA 数据集 PUGG 数据集,以及 MRC 和 IR 的新数据集。此外,我们还提供了全面的实施方案、深入的研究结果、详细的统计数据以及对基准模型的评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction
Advancements in AI and natural language processing have revolutionized machine-human language interactions, with question answering (QA) systems playing a pivotal role. The knowledge base question answering (KBQA) task, utilizing structured knowledge graphs (KG), allows for handling extensive knowledge-intensive questions. However, a significant gap exists in KBQA datasets, especially for low-resource languages. Many existing construction pipelines for these datasets are outdated and inefficient in human labor, and modern assisting tools like Large Language Models (LLM) are not utilized to reduce the workload. To address this, we have designed and implemented a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR), tailored explicitly for low-resource environments. We executed this pipeline and introduced the PUGG dataset, the first Polish KBQA dataset, and novel datasets for MRC and IR. Additionally, we provide a comprehensive implementation, insightful findings, detailed statistics, and evaluation of baseline models.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Abductive explanations of classifiers under constraints: Complexity and properties Explaining Non-monotonic Normative Reasoning using Argumentation Theory with Deontic Logic Towards Explainable Goal Recognition Using Weight of Evidence (WoE): A Human-Centered Approach A Metric Hybrid Planning Approach to Solving Pandemic Planning Problems with Simple SIR Models Neural Networks for Vehicle Routing Problem
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1