Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction

arXiv - CS - Artificial Intelligence Pub Date : 2024-08-05 DOI:arxiv-2408.02337

Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz

{"title":"Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction","authors":"Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz","doi":"arxiv-2408.02337","DOIUrl":null,"url":null,"abstract":"Advancements in AI and natural language processing have revolutionized\nmachine-human language interactions, with question answering (QA) systems\nplaying a pivotal role. The knowledge base question answering (KBQA) task,\nutilizing structured knowledge graphs (KG), allows for handling extensive\nknowledge-intensive questions. However, a significant gap exists in KBQA\ndatasets, especially for low-resource languages. Many existing construction\npipelines for these datasets are outdated and inefficient in human labor, and\nmodern assisting tools like Large Language Models (LLM) are not utilized to\nreduce the workload. To address this, we have designed and implemented a\nmodern, semi-automated approach for creating datasets, encompassing tasks such\nas KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR),\ntailored explicitly for low-resource environments. We executed this pipeline\nand introduced the PUGG dataset, the first Polish KBQA dataset, and novel\ndatasets for MRC and IR. Additionally, we provide a comprehensive\nimplementation, insightful findings, detailed statistics, and evaluation of\nbaseline models.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.02337","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Advancements in AI and natural language processing have revolutionized machine-human language interactions, with question answering (QA) systems playing a pivotal role. The knowledge base question answering (KBQA) task, utilizing structured knowledge graphs (KG), allows for handling extensive knowledge-intensive questions. However, a significant gap exists in KBQA datasets, especially for low-resource languages. Many existing construction pipelines for these datasets are outdated and inefficient in human labor, and modern assisting tools like Large Language Models (LLM) are not utilized to reduce the workload. To address this, we have designed and implemented a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR), tailored explicitly for low-resource environments. We executed this pipeline and introduced the PUGG dataset, the first Polish KBQA dataset, and novel datasets for MRC and IR. Additionally, we provide a comprehensive implementation, insightful findings, detailed statistics, and evaluation of baseline models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为波兰语开发 PUGG：构建 KBQA、MRC 和 IR 数据集的现代方法

人工智能和自然语言处理技术的进步彻底改变了机器与人类之间的语言交互，其中问题解答（QA）系统发挥着举足轻重的作用。知识库问题解答（KBQA）任务利用结构化知识图谱（KG），可以处理大量知识密集型问题。然而，在知识库问题解答数据集方面存在很大差距，尤其是在低资源语言方面。这些数据集的许多现有构建管道已经过时，人力效率低下，而且没有利用大语言模型（LLM）等现代辅助工具来减少工作量。为了解决这个问题，我们设计并实施了一种现代的半自动化数据集创建方法，其中包括 KBQA、机器阅读理解（MRC）和信息检索（IR）等任务，专门为低资源环境量身定制。我们实施了这一流程，并推出了波兰首个 KBQA 数据集 PUGG 数据集，以及 MRC 和 IR 的新数据集。此外，我们还提供了全面的实施方案、深入的研究结果、详细的统计数据以及对基准模型的评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Artificial Intelligence

自引率

0.00%

发文量