MaTableGPT: GPT-Based Table Data Extractor from Materials Science Literature

IF 14.1 1区 材料科学 Q1 CHEMISTRY, MULTIDISCIPLINARY Advanced Science Pub Date : 2025-01-24 DOI:10.1002/advs.202408221
Gyeong Hoon Yi, Jiwoo Choi, Hyeongyun Song, Olivia Miano, Jaewoong Choi, Kihoon Bang, Byungju Lee, Seok Su Sohn, David Buttler, Anna Hiszpanski, Sang Soo Han, Donghun Kim
{"title":"MaTableGPT: GPT-Based Table Data Extractor from Materials Science Literature","authors":"Gyeong Hoon Yi,&nbsp;Jiwoo Choi,&nbsp;Hyeongyun Song,&nbsp;Olivia Miano,&nbsp;Jaewoong Choi,&nbsp;Kihoon Bang,&nbsp;Byungju Lee,&nbsp;Seok Su Sohn,&nbsp;David Buttler,&nbsp;Anna Hiszpanski,&nbsp;Sang Soo Han,&nbsp;Donghun Kim","doi":"10.1002/advs.202408221","DOIUrl":null,"url":null,"abstract":"<p>Efficiently extracting data from tables in the scientific literature is pivotal for building large-scale databases. However, the tables reported in materials science papers exist in highly diverse forms; thus, rule-based extractions are an ineffective approach. To overcome this challenge, the study presents MaTableGPT, which is a GPT-based table data extractor from the materials science literature. MaTableGPT features key strategies of table data representation and table splitting for better GPT comprehension and filtering hallucinated information through follow-up questions. When applied to a vast volume of water splitting catalysis literature, MaTableGPT achieves an extraction accuracy (total F1 score) of up to 96.8%. Through comprehensive evaluations of the GPT usage cost, labeling cost, and extraction accuracy for the learning methods of zero-shot, few-shot, and fine-tuning, the study presents a Pareto-front mapping where the few-shot learning method is found to be the most balanced solution owing to both its high extraction accuracy (total F1 score &gt;95%) and low cost (GPT usage cost of 5.97 US dollars and labeling cost of 10 I/O paired examples). The statistical analyses conducted on the database generated by MaTableGPT revealed valuable insights into the distribution of the overpotential and elemental utilization across the reported catalysts in the water splitting literature.</p>","PeriodicalId":117,"journal":{"name":"Advanced Science","volume":"12 16","pages":""},"PeriodicalIF":14.1000,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/advs.202408221","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advanced Science","FirstCategoryId":"88","ListUrlMain":"https://advanced.onlinelibrary.wiley.com/doi/10.1002/advs.202408221","RegionNum":1,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Efficiently extracting data from tables in the scientific literature is pivotal for building large-scale databases. However, the tables reported in materials science papers exist in highly diverse forms; thus, rule-based extractions are an ineffective approach. To overcome this challenge, the study presents MaTableGPT, which is a GPT-based table data extractor from the materials science literature. MaTableGPT features key strategies of table data representation and table splitting for better GPT comprehension and filtering hallucinated information through follow-up questions. When applied to a vast volume of water splitting catalysis literature, MaTableGPT achieves an extraction accuracy (total F1 score) of up to 96.8%. Through comprehensive evaluations of the GPT usage cost, labeling cost, and extraction accuracy for the learning methods of zero-shot, few-shot, and fine-tuning, the study presents a Pareto-front mapping where the few-shot learning method is found to be the most balanced solution owing to both its high extraction accuracy (total F1 score >95%) and low cost (GPT usage cost of 5.97 US dollars and labeling cost of 10 I/O paired examples). The statistical analyses conducted on the database generated by MaTableGPT revealed valuable insights into the distribution of the overpotential and elemental utilization across the reported catalysts in the water splitting literature.

Abstract Image

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MaTableGPT:基于gpt的材料科学文献表数据提取器。
有效地从科学文献中的表中提取数据对于构建大型数据库至关重要。然而,材料科学论文中报告的表以高度多样化的形式存在;因此,基于规则的提取是一种无效的方法。为了克服这一挑战,该研究提出了MaTableGPT,这是一种基于gpt的表数据提取器,来自材料科学文献。MaTableGPT具有表数据表示和表拆分的关键策略,以便更好地理解GPT,并通过后续问题过滤幻觉信息。当应用于大量的水裂解催化文献时,MaTableGPT的提取精度(F1总分)高达96.8%。通过对zero-shot、few-shot和fine-tuning三种学习方法的GPT使用成本、标注成本和提取精度进行综合评价,提出了一种Pareto-front映射,其中few-shot学习方法具有较高的提取精度(F1总分bb0 95%)和较低的成本(GPT使用成本为5.97美元,标注成本为10个I/O配对样例),是最平衡的解决方案。对由MaTableGPT生成的数据库进行的统计分析揭示了对水分解文献中报道的催化剂的过电位分布和元素利用率的有价值的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Advanced Science
Advanced Science CHEMISTRY, MULTIDISCIPLINARYNANOSCIENCE &-NANOSCIENCE & NANOTECHNOLOGY
CiteScore
18.90
自引率
2.60%
发文量
1602
审稿时长
1.9 months
期刊介绍: Advanced Science is a prestigious open access journal that focuses on interdisciplinary research in materials science, physics, chemistry, medical and life sciences, and engineering. The journal aims to promote cutting-edge research by employing a rigorous and impartial review process. It is committed to presenting research articles with the highest quality production standards, ensuring maximum accessibility of top scientific findings. With its vibrant and innovative publication platform, Advanced Science seeks to revolutionize the dissemination and organization of scientific knowledge.
期刊最新文献
Breaking the Bottleneck in Limit of Detection of Surface Refractive Index Sensing by Harnessing Meta-Waveguide Microring Resonators. NSUN5 Attenuates Renal Injury and Ferroptosis in Hyperuricaemic Nephropathy Through YBX2-Dependent Stabilisation of SCD1 m5C Methylation. Cell Cycle Control of Nuclear Metabolism Couples Phosphatidylinositol Signaling to Histone Methylation. MetalCenter-Dependent Selectivity Divergence in MN4 Single-Atom Catalysts for Aerobic HMF Oxidation. High-Power Terahertz Emission from Picosecond Nano-Plasma Switching Driven by Secondary Electron Emission Avalanche.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1