Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz
{"title":"Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction","authors":"Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz","doi":"arxiv-2408.02337","DOIUrl":null,"url":null,"abstract":"Advancements in AI and natural language processing have revolutionized\nmachine-human language interactions, with question answering (QA) systems\nplaying a pivotal role. The knowledge base question answering (KBQA) task,\nutilizing structured knowledge graphs (KG), allows for handling extensive\nknowledge-intensive questions. However, a significant gap exists in KBQA\ndatasets, especially for low-resource languages. Many existing construction\npipelines for these datasets are outdated and inefficient in human labor, and\nmodern assisting tools like Large Language Models (LLM) are not utilized to\nreduce the workload. To address this, we have designed and implemented a\nmodern, semi-automated approach for creating datasets, encompassing tasks such\nas KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR),\ntailored explicitly for low-resource environments. We executed this pipeline\nand introduced the PUGG dataset, the first Polish KBQA dataset, and novel\ndatasets for MRC and IR. Additionally, we provide a comprehensive\nimplementation, insightful findings, detailed statistics, and evaluation of\nbaseline models.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.02337","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Advancements in AI and natural language processing have revolutionized
machine-human language interactions, with question answering (QA) systems
playing a pivotal role. The knowledge base question answering (KBQA) task,
utilizing structured knowledge graphs (KG), allows for handling extensive
knowledge-intensive questions. However, a significant gap exists in KBQA
datasets, especially for low-resource languages. Many existing construction
pipelines for these datasets are outdated and inefficient in human labor, and
modern assisting tools like Large Language Models (LLM) are not utilized to
reduce the workload. To address this, we have designed and implemented a
modern, semi-automated approach for creating datasets, encompassing tasks such
as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR),
tailored explicitly for low-resource environments. We executed this pipeline
and introduced the PUGG dataset, the first Polish KBQA dataset, and novel
datasets for MRC and IR. Additionally, we provide a comprehensive
implementation, insightful findings, detailed statistics, and evaluation of
baseline models.