Knowledge assimilation: Implementing knowledge-guided agricultural large language model

IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Knowledge-Based Systems Pub Date : 2025-04-08 Epub Date: 2025-02-24 DOI:10.1016/j.knosys.2025.113197
Jingchi Jiang , Lian Yan , Haifeng Liu , Zhenbo Xia , Haotian Wang , Yang Yang , Yi Guan
{"title":"Knowledge assimilation: Implementing knowledge-guided agricultural large language model","authors":"Jingchi Jiang ,&nbsp;Lian Yan ,&nbsp;Haifeng Liu ,&nbsp;Zhenbo Xia ,&nbsp;Haotian Wang ,&nbsp;Yang Yang ,&nbsp;Yi Guan","doi":"10.1016/j.knosys.2025.113197","DOIUrl":null,"url":null,"abstract":"<div><div>Although supervised fine-tuning (SFT) and retrieval-augmented generation (RAG) can help large language models (LLMs) incorporate domain knowledge, they have the following limitations: (1) Data scarcity. There is a severe lack of high-quality data and knowledge bases on dialogue in agriculture. (2) Token-level oversight. Current SFT primarily focuses on fitting general tokens, neglecting agricultural-specific tokens. It leads to omissions of critical information in responses. (3) Sentence-level hurdle. Agricultural queries necessitate sentence-level evidence support from domain knowledge bases, which poses a challenge to precision evidence retrievers. This paper introduces a novel Knowledge-guided Agriculture LLM (KALLM) designed to facilitate multi-task decision-making in agricultural settings. We begin by addressing the data quality issue by establishing an annotation standard and constructing a comprehensive dataset consisting of 220,000 Q&amp;A pairs derived from authoritative agricultural documents. At the token level, we propose a knowledge-coordinated SFT approach that enhances the representation of agriculture-specific tokens by amplifying their significance during the decoding process. At the sentence level, we introduce a self-reflective RAG mechanism based on topic matching to improve the accuracy of evidence retrieval. Experimental results compared with seven competitive open-domain LLMs and the current SFT-RAG pipeline show that our KALLM achieves state-of-the-art performance and is significantly superior to existing generation frameworks in terms of response fluency, accuracy, and domain fidelity.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"314 ","pages":"Article 113197"},"PeriodicalIF":7.6000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125002448","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/24 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Although supervised fine-tuning (SFT) and retrieval-augmented generation (RAG) can help large language models (LLMs) incorporate domain knowledge, they have the following limitations: (1) Data scarcity. There is a severe lack of high-quality data and knowledge bases on dialogue in agriculture. (2) Token-level oversight. Current SFT primarily focuses on fitting general tokens, neglecting agricultural-specific tokens. It leads to omissions of critical information in responses. (3) Sentence-level hurdle. Agricultural queries necessitate sentence-level evidence support from domain knowledge bases, which poses a challenge to precision evidence retrievers. This paper introduces a novel Knowledge-guided Agriculture LLM (KALLM) designed to facilitate multi-task decision-making in agricultural settings. We begin by addressing the data quality issue by establishing an annotation standard and constructing a comprehensive dataset consisting of 220,000 Q&A pairs derived from authoritative agricultural documents. At the token level, we propose a knowledge-coordinated SFT approach that enhances the representation of agriculture-specific tokens by amplifying their significance during the decoding process. At the sentence level, we introduce a self-reflective RAG mechanism based on topic matching to improve the accuracy of evidence retrieval. Experimental results compared with seven competitive open-domain LLMs and the current SFT-RAG pipeline show that our KALLM achieves state-of-the-art performance and is significantly superior to existing generation frameworks in terms of response fluency, accuracy, and domain fidelity.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
知识同化:实现知识导向的农业大语言模型
尽管监督微调(SFT)和检索增强生成(RAG)可以帮助大型语言模型(llm)整合领域知识,但它们有以下局限性:(1)数据稀缺性。关于农业对话的高质量数据和知识库严重缺乏。(2)代币级监督。目前的SFT主要侧重于匹配一般代币,而忽略了农业特定代币。它会导致在回应中遗漏关键信息。(3)句子级障碍。农业查询需要来自领域知识库的句子级证据支持,这对精确的证据检索提出了挑战。本文介绍了一种新的知识引导农业法学硕士(KALLM),旨在促进农业环境中的多任务决策。我们首先通过建立注释标准和构建一个由来自权威农业文献的22万个Q&; a对组成的综合数据集来解决数据质量问题。在令牌层面,我们提出了一种知识协调的SFT方法,通过在解码过程中放大其重要性来增强农业特定令牌的表示。在句子层面,引入基于主题匹配的自反射RAG机制,提高证据检索的准确性。与7个竞争性开放域llm和当前SFT-RAG管道的实验结果相比,我们的KALLM达到了最先进的性能,并且在响应流畅性,准确性和域保真度方面明显优于现有的生成框架。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Knowledge-Based Systems
Knowledge-Based Systems 工程技术-计算机:人工智能
CiteScore
14.80
自引率
12.50%
发文量
1245
审稿时长
7.8 months
期刊介绍: Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.
期刊最新文献
Revisiting the role of linguistic knowledge in large language models through prompting ACO–PAL: A prior-Aware learning framework for local path planning in complex environments LLM-enabled universal traffic signal control across different intersections and traffic flows Multi-view semi-supervised classification via innovative graph construction and smoothness-aware graph convolution Galio: Defending ownership of AI-generated images against content-preserving tampering
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1