Daehee Kim, Deokhyung Kang, Sangwon Ryu, Gary Geunbae Lee
{"title":"Ontology-Free General-Domain Knowledge Graph-to-Text Generation Dataset Synthesis using Large Language Model","authors":"Daehee Kim, Deokhyung Kang, Sangwon Ryu, Gary Geunbae Lee","doi":"arxiv-2409.07088","DOIUrl":null,"url":null,"abstract":"Knowledge Graph-to-Text (G2T) generation involves verbalizing structured\nknowledge graphs into natural language text. Recent advancements in Pretrained\nLanguage Models (PLMs) have improved G2T performance, but their effectiveness\ndepends on datasets with precise graph-text alignment. However, the scarcity of\nhigh-quality, general-domain G2T generation datasets restricts progress in the\ngeneral-domain G2T generation research. To address this issue, we introduce\nWikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T\ndataset generated using a novel method that leverages Large Language Model\n(LLM) and Data-QuestEval. Our new dataset, which contains 5.85M general-domain\ngraph-text pairs, offers high graph-text consistency without relying on\nexternal ontologies. Experimental results demonstrate that PLM fine-tuned on\nWikiOFGraph outperforms those trained on other datasets across various\nevaluation metrics. Our method proves to be a scalable and effective solution\nfor generating high-quality G2T data, significantly advancing the field of G2T\ngeneration.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07088","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Knowledge Graph-to-Text (G2T) generation involves verbalizing structured
knowledge graphs into natural language text. Recent advancements in Pretrained
Language Models (PLMs) have improved G2T performance, but their effectiveness
depends on datasets with precise graph-text alignment. However, the scarcity of
high-quality, general-domain G2T generation datasets restricts progress in the
general-domain G2T generation research. To address this issue, we introduce
Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T
dataset generated using a novel method that leverages Large Language Model
(LLM) and Data-QuestEval. Our new dataset, which contains 5.85M general-domain
graph-text pairs, offers high graph-text consistency without relying on
external ontologies. Experimental results demonstrate that PLM fine-tuned on
WikiOFGraph outperforms those trained on other datasets across various
evaluation metrics. Our method proves to be a scalable and effective solution
for generating high-quality G2T data, significantly advancing the field of G2T
generation.