MinCache: A hybrid cache system for efficient chatbots with hierarchical embedding matching and LLM

IF 6.2 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-09-01 Epub Date: 2025-03-25 DOI:10.1016/j.future.2025.107822
Keihan Haqiq , Majid Vafaei Jahan , Saeede Anbaee Farimani , Seyed Mahmood Fattahi Masoom
{"title":"MinCache: A hybrid cache system for efficient chatbots with hierarchical embedding matching and LLM","authors":"Keihan Haqiq ,&nbsp;Majid Vafaei Jahan ,&nbsp;Saeede Anbaee Farimani ,&nbsp;Seyed Mahmood Fattahi Masoom","doi":"10.1016/j.future.2025.107822","DOIUrl":null,"url":null,"abstract":"<div><div>Large Language Models (LLMs) have emerged as powerful tools for various natural language processing tasks such as multi-agent chatbots, but their computational complexity and resource requirements pose significant challenges for real-time chatbot applications. Caching strategies can alleviate these challenges by reducing redundant computations and improving response times. In this paper, we propose MinCache, a novel hybrid caching system tailored for LLM applications. Our system employs a hierarchical cache strategy for string retrieval, performing exact match lookups first, followed by resemblance matching, and finally resorting to semantic matching to deliver the most relevant information. MinCache combines the strengths of Least Recently Used (LRU) cache and string fingerprints caching techniques, leveraging MinHash algorithm for fast the <em>resemblance</em> matching. Additionally, Mincache leverage a sentence-transformer for estimating <em>semantics</em> of input prompts. By integrating these approaches, MinCache delivers high cache hit rates, faster response delivery, and improved scalability for LLM applications across diverse domains. Our experiments demonstrate a significant acceleration of LLM applications by up to <span>4.5X</span> against GPTCache as well as improvements in accurate cache hit rate. We also discuss the scalability of our proposed approach across medical domain chat services.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"170 ","pages":"Article 107822"},"PeriodicalIF":6.2000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25001177","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/25 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Large Language Models (LLMs) have emerged as powerful tools for various natural language processing tasks such as multi-agent chatbots, but their computational complexity and resource requirements pose significant challenges for real-time chatbot applications. Caching strategies can alleviate these challenges by reducing redundant computations and improving response times. In this paper, we propose MinCache, a novel hybrid caching system tailored for LLM applications. Our system employs a hierarchical cache strategy for string retrieval, performing exact match lookups first, followed by resemblance matching, and finally resorting to semantic matching to deliver the most relevant information. MinCache combines the strengths of Least Recently Used (LRU) cache and string fingerprints caching techniques, leveraging MinHash algorithm for fast the resemblance matching. Additionally, Mincache leverage a sentence-transformer for estimating semantics of input prompts. By integrating these approaches, MinCache delivers high cache hit rates, faster response delivery, and improved scalability for LLM applications across diverse domains. Our experiments demonstrate a significant acceleration of LLM applications by up to 4.5X against GPTCache as well as improvements in accurate cache hit rate. We also discuss the scalability of our proposed approach across medical domain chat services.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MinCache:一种基于分层嵌入匹配和LLM的高效聊天机器人混合缓存系统
大型语言模型(llm)已经成为各种自然语言处理任务(如多智能体聊天机器人)的强大工具,但它们的计算复杂性和资源需求给实时聊天机器人应用带来了重大挑战。缓存策略可以通过减少冗余计算和改进响应时间来缓解这些挑战。在本文中,我们提出了MinCache,一种为LLM应用量身定制的新型混合缓存系统。我们的系统采用分层缓存策略进行字符串检索,首先执行精确匹配查找,然后进行相似性匹配,最后诉诸语义匹配来提供最相关的信息。MinCache结合了最近最少使用(Least Recently Used, LRU)缓存和字符串指纹缓存技术的优点,利用MinHash算法快速进行相似性匹配。此外,Mincache利用一个句子转换器来估计输入提示符的语义。通过集成这些方法,MinCache为不同领域的LLM应用程序提供了高缓存命中率、更快的响应交付和改进的可扩展性。我们的实验证明了LLM应用程序在GPTCache上的显著加速高达4.5倍,并提高了准确的缓存命中率。我们还讨论了我们提出的方法在医疗领域聊天服务中的可扩展性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
19.90
自引率
2.70%
发文量
376
审稿时长
10.6 months
期刊介绍: Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.
期刊最新文献
Blockchain architectures for enhancing EV infrastructure security: A unified framework for addressing sophisticated cyber-attacks Applying quantum error-correcting codes for fault-tolerant blind quantum cloud computation A swarm intelligence enabled multi-agent reinforcement learning scheme for computational task offloading in internet of things blockchain KnowAIDE: A fAIR-compliant data environment to accelerate AI research Non-intrusive kernel-level dispatching for MQTT shared subscriptions
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1