Distilling implicit multimodal knowledge into large language models for zero-resource dialogue generation

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Information Fusion Pub Date : 2025-06-01 Epub Date: 2025-02-04 DOI:10.1016/j.inffus.2025.102985

Bo Zhang , Hui Ma , Jian Ding , Jian Wang , Bo Xu , Hongfei Lin

{"title":"Distilling implicit multimodal knowledge into large language models for zero-resource dialogue generation","authors":"Bo Zhang , Hui Ma , Jian Ding , Jian Wang , Bo Xu , Hongfei Lin","doi":"10.1016/j.inffus.2025.102985","DOIUrl":null,"url":null,"abstract":"<div><div>Integrating multimodal knowledge into large language models (LLMs) represents a significant advancement in dialogue generation capabilities. However, the effective incorporation of such knowledge in zero-resource scenarios remains a substantial challenge due to the scarcity of diverse, high-quality dialogue datasets. To address this, we propose the Visual Implicit Knowledge Distillation Framework (VIKDF), an innovative approach aimed at enhancing LLMs for enriched dialogue generation in zero-resource contexts by leveraging implicit multimodal knowledge. VIKDF comprises two main stages: knowledge distillation, using an Implicit Query Transformer to extract and encode visual implicit knowledge from image–text pairs into knowledge vectors; and knowledge integration, employing a novel Bidirectional Variational Information Fusion technique to seamlessly integrate these distilled vectors into LLMs. This enables the LLMs to generate dialogues that are not only coherent and engaging but also exhibit a deep understanding of the context through implicit multimodal cues, effectively overcoming the limitations of zero-resource scenarios. Our extensive experimentation across two dialogue datasets shows that VIKDF outperforms existing state-of-the-art models in generating high-quality dialogues. The code is available at <span><span>https://github.com/zhangbo-nlp/VIKDF</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"118 ","pages":"Article 102985"},"PeriodicalIF":15.5000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525000582","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/4 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Integrating multimodal knowledge into large language models (LLMs) represents a significant advancement in dialogue generation capabilities. However, the effective incorporation of such knowledge in zero-resource scenarios remains a substantial challenge due to the scarcity of diverse, high-quality dialogue datasets. To address this, we propose the Visual Implicit Knowledge Distillation Framework (VIKDF), an innovative approach aimed at enhancing LLMs for enriched dialogue generation in zero-resource contexts by leveraging implicit multimodal knowledge. VIKDF comprises two main stages: knowledge distillation, using an Implicit Query Transformer to extract and encode visual implicit knowledge from image–text pairs into knowledge vectors; and knowledge integration, employing a novel Bidirectional Variational Information Fusion technique to seamlessly integrate these distilled vectors into LLMs. This enables the LLMs to generate dialogues that are not only coherent and engaging but also exhibit a deep understanding of the context through implicit multimodal cues, effectively overcoming the limitations of zero-resource scenarios. Our extensive experimentation across two dialogue datasets shows that VIKDF outperforms existing state-of-the-art models in generating high-quality dialogues. The code is available at https://github.com/zhangbo-nlp/VIKDF.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

将隐式多模态知识提炼成大型语言模型，用于零资源对话生成

将多模态知识集成到大型语言模型（llm）中代表了对话生成能力的重大进步。然而，由于缺乏多样化、高质量的对话数据集，在零资源场景中有效地整合这些知识仍然是一个重大挑战。为了解决这个问题，我们提出了视觉隐性知识蒸馏框架（VIKDF），这是一种创新的方法，旨在通过利用隐性多模态知识来增强法学硕士在零资源环境下丰富对话生成的能力。VIKDF包括两个主要阶段：知识蒸馏，使用隐式查询转换器从图像-文本对中提取视觉隐式知识并将其编码为知识向量；采用一种新颖的双向变分信息融合技术，将这些提取的向量无缝集成到llm中。这使得llm能够生成不仅连贯和引人入胜的对话，而且通过隐含的多模态线索对上下文有深刻的理解，有效地克服了零资源场景的局限性。我们在两个对话数据集上的广泛实验表明，VIKDF在生成高质量对话方面优于现有的最先进的模型。代码可在https://github.com/zhangbo-nlp/VIKDF上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.