BB-GeoGPT: A framework for learning a large language model for geographic information science

IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Information Processing & Management Pub Date : 2024-06-22 DOI:10.1016/j.ipm.2024.103808
Yifan Zhang , Zhiyun Wang , Zhengting He , Jingxuan Li , Gengchen Mai , Jianfeng Lin , Cheng Wei , Wenhao Yu
{"title":"BB-GeoGPT: A framework for learning a large language model for geographic information science","authors":"Yifan Zhang ,&nbsp;Zhiyun Wang ,&nbsp;Zhengting He ,&nbsp;Jingxuan Li ,&nbsp;Gengchen Mai ,&nbsp;Jianfeng Lin ,&nbsp;Cheng Wei ,&nbsp;Wenhao Yu","doi":"10.1016/j.ipm.2024.103808","DOIUrl":null,"url":null,"abstract":"<div><p>Large language models (LLMs) exhibit impressive capabilities across diverse tasks in natural language processing. Nevertheless, challenges arise such as large model parameter size and limited model accessibility through APIs such as ChatGPT and GPT-4, which prohibits the model deployment on mobile devices and domain adaptation or fine-tuning. Moreover, while LLMs excel in general domains, their performance in specialized fields such as GIS may not always align with the expectations of domain experts. This is primarily attributed to the diverse disciplinary origins of the training data, which often lack comprehensive coverage and treatment of knowledge specific to individual disciplines (e.g., GIS). Therefore, there is a crucial need to train and adapt LLMs specifically designed for different professional fields. In this paper, our focus is on the GIS domain, where we introduce BB(BaBy)-GeoGPT, a large language model with GIS-specific knowledge. To achieve this goal, we curated a comprehensive set of resources, comprising model pretraining data (BB-GeoPT, 26,907 documents), supervised fine-tuning data (BB-GeoSFT, 35,876 instructions), and evaluation data (BB-GeoEval, 600 objective questions and 150 subjective questions). BB-GeoGPT is developed by first adapting an open-source general-domain LLM, the LLaMA-2-7B model, to our pretraining data. Subsequently, we use instruction tuning to further fine-tune the model on our BB-GeoSFT. Through extensive experiments on the evaluation dataset, BB-GeoGPT demonstrates improvements ranging from 10.55% to 47.57% for objective questions and from 7.87% to 27.73% for subjective questions, when compared to general LLMs of similar size in terms of accuracy. Moreover, our data collection strategy and the amassed data can serve as a foundation for advancing LLM research in the GIS domain, fostering further development.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":null,"pages":null},"PeriodicalIF":7.4000,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324001675","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Large language models (LLMs) exhibit impressive capabilities across diverse tasks in natural language processing. Nevertheless, challenges arise such as large model parameter size and limited model accessibility through APIs such as ChatGPT and GPT-4, which prohibits the model deployment on mobile devices and domain adaptation or fine-tuning. Moreover, while LLMs excel in general domains, their performance in specialized fields such as GIS may not always align with the expectations of domain experts. This is primarily attributed to the diverse disciplinary origins of the training data, which often lack comprehensive coverage and treatment of knowledge specific to individual disciplines (e.g., GIS). Therefore, there is a crucial need to train and adapt LLMs specifically designed for different professional fields. In this paper, our focus is on the GIS domain, where we introduce BB(BaBy)-GeoGPT, a large language model with GIS-specific knowledge. To achieve this goal, we curated a comprehensive set of resources, comprising model pretraining data (BB-GeoPT, 26,907 documents), supervised fine-tuning data (BB-GeoSFT, 35,876 instructions), and evaluation data (BB-GeoEval, 600 objective questions and 150 subjective questions). BB-GeoGPT is developed by first adapting an open-source general-domain LLM, the LLaMA-2-7B model, to our pretraining data. Subsequently, we use instruction tuning to further fine-tune the model on our BB-GeoSFT. Through extensive experiments on the evaluation dataset, BB-GeoGPT demonstrates improvements ranging from 10.55% to 47.57% for objective questions and from 7.87% to 27.73% for subjective questions, when compared to general LLMs of similar size in terms of accuracy. Moreover, our data collection strategy and the amassed data can serve as a foundation for advancing LLM research in the GIS domain, fostering further development.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
BB-GeoGPT:地理信息科学大语言模型学习框架
大型语言模型(LLM)在自然语言处理的各种任务中表现出令人印象深刻的能力。然而,大型语言模型也面临着一些挑战,例如模型参数过大,以及通过 ChatGPT 和 GPT-4 等应用程序接口对模型的访问受限,这都阻碍了模型在移动设备上的部署以及领域适应或微调。此外,虽然 LLM 在一般领域表现出色,但在 GIS 等专业领域的表现却不一定符合领域专家的期望。这主要是由于训练数据的学科来源不同,往往缺乏对个别学科(如地理信息系统)特定知识的全面覆盖和处理。因此,亟需培训和调整专门针对不同专业领域设计的 LLM。在本文中,我们将重点放在地理信息系统(GIS)领域,并在此引入 BB(BaBy)-GeoGPT 这个具有地理信息系统特定知识的大型语言模型。为实现这一目标,我们策划了一套全面的资源,包括模型预训练数据(BB-GeoPT,26907 个文档)、监督微调数据(BB-GeoSFT,35876 个指令)和评估数据(BB-GeoEval,600 个客观问题和 150 个主观问题)。BB-GeoGPT 是通过首先将开源通用域 LLM(LLaMA-2-7B 模型)适配到我们的预训练数据而开发的。随后,我们在 BB-GeoSFT 上使用指令调整来进一步微调模型。通过在评估数据集上的大量实验,BB-GeoGPT 在客观问题和主观问题上的准确率分别提高了 10.55% 到 47.57%、7.87% 到 27.73%。此外,我们的数据收集策略和积累的数据可作为推进 GIS 领域 LLM 研究的基础,促进进一步发展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Information Processing & Management
Information Processing & Management 工程技术-计算机:信息系统
CiteScore
17.00
自引率
11.60%
发文量
276
审稿时长
39 days
期刊介绍: Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.
期刊最新文献
Fusing temporal and semantic dependencies for session-based recommendation A Universal Adaptive Algorithm for Graph Anomaly Detection A context-aware attention and graph neural network-based multimodal framework for misogyny detection Multi-granularity contrastive zero-shot learning model based on attribute decomposition Asymmetric augmented paradigm-based graph neural architecture search
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1