Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz
{"title":"为波兰语开发 PUGG:构建 KBQA、MRC 和 IR 数据集的现代方法","authors":"Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz","doi":"arxiv-2408.02337","DOIUrl":null,"url":null,"abstract":"Advancements in AI and natural language processing have revolutionized\nmachine-human language interactions, with question answering (QA) systems\nplaying a pivotal role. The knowledge base question answering (KBQA) task,\nutilizing structured knowledge graphs (KG), allows for handling extensive\nknowledge-intensive questions. However, a significant gap exists in KBQA\ndatasets, especially for low-resource languages. Many existing construction\npipelines for these datasets are outdated and inefficient in human labor, and\nmodern assisting tools like Large Language Models (LLM) are not utilized to\nreduce the workload. To address this, we have designed and implemented a\nmodern, semi-automated approach for creating datasets, encompassing tasks such\nas KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR),\ntailored explicitly for low-resource environments. We executed this pipeline\nand introduced the PUGG dataset, the first Polish KBQA dataset, and novel\ndatasets for MRC and IR. Additionally, we provide a comprehensive\nimplementation, insightful findings, detailed statistics, and evaluation of\nbaseline models.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction\",\"authors\":\"Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz\",\"doi\":\"arxiv-2408.02337\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Advancements in AI and natural language processing have revolutionized\\nmachine-human language interactions, with question answering (QA) systems\\nplaying a pivotal role. The knowledge base question answering (KBQA) task,\\nutilizing structured knowledge graphs (KG), allows for handling extensive\\nknowledge-intensive questions. However, a significant gap exists in KBQA\\ndatasets, especially for low-resource languages. Many existing construction\\npipelines for these datasets are outdated and inefficient in human labor, and\\nmodern assisting tools like Large Language Models (LLM) are not utilized to\\nreduce the workload. To address this, we have designed and implemented a\\nmodern, semi-automated approach for creating datasets, encompassing tasks such\\nas KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR),\\ntailored explicitly for low-resource environments. We executed this pipeline\\nand introduced the PUGG dataset, the first Polish KBQA dataset, and novel\\ndatasets for MRC and IR. Additionally, we provide a comprehensive\\nimplementation, insightful findings, detailed statistics, and evaluation of\\nbaseline models.\",\"PeriodicalId\":501479,\"journal\":{\"name\":\"arXiv - CS - Artificial Intelligence\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.02337\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.02337","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
人工智能和自然语言处理技术的进步彻底改变了机器与人类之间的语言交互,其中问题解答(QA)系统发挥着举足轻重的作用。知识库问题解答(KBQA)任务利用结构化知识图谱(KG),可以处理大量知识密集型问题。然而,在知识库问题解答数据集方面存在很大差距,尤其是在低资源语言方面。这些数据集的许多现有构建管道已经过时,人力效率低下,而且没有利用大语言模型(LLM)等现代辅助工具来减少工作量。为了解决这个问题,我们设计并实施了一种现代的半自动化数据集创建方法,其中包括 KBQA、机器阅读理解(MRC)和信息检索(IR)等任务,专门为低资源环境量身定制。我们实施了这一流程,并推出了波兰首个 KBQA 数据集 PUGG 数据集,以及 MRC 和 IR 的新数据集。此外,我们还提供了全面的实施方案、深入的研究结果、详细的统计数据以及对基准模型的评估。
Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction
Advancements in AI and natural language processing have revolutionized
machine-human language interactions, with question answering (QA) systems
playing a pivotal role. The knowledge base question answering (KBQA) task,
utilizing structured knowledge graphs (KG), allows for handling extensive
knowledge-intensive questions. However, a significant gap exists in KBQA
datasets, especially for low-resource languages. Many existing construction
pipelines for these datasets are outdated and inefficient in human labor, and
modern assisting tools like Large Language Models (LLM) are not utilized to
reduce the workload. To address this, we have designed and implemented a
modern, semi-automated approach for creating datasets, encompassing tasks such
as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR),
tailored explicitly for low-resource environments. We executed this pipeline
and introduced the PUGG dataset, the first Polish KBQA dataset, and novel
datasets for MRC and IR. Additionally, we provide a comprehensive
implementation, insightful findings, detailed statistics, and evaluation of
baseline models.