Large language models (LLMs) are widely used in real-time interface systems that process user commands. Despite their high output quality, the long response times and substantial operating costs undermine the practicality and sustainability of LLM-based services. Prompt caching is one of the optimization techniques introduced to mitigate the problem. It avoids redundant processing of repetitive prompts by caching and reusing the response for the same or similar prompts. However, such a static caching scheme has an intrinsic limitation, in terms of the reusability of results, due to the variety of expressions having the same semantics in real-world usage environments. In this paper, we introduce a new strategy for prompt caching, Snippet Caching, for LLM-based command-driven IoT systems to overcome the limitation. It perceives a command (prompt) as a function call with specific arguments. Instead of caching (input, output) pairs, it caches two simple code snippets that mimic LLM operations for each function. Based on the strategy, we design a novel prompt caching scheme, Snip-Cache, which generates code snippets with the help of LLMs. Experimental results show that Snip-Cache is significantly more beneficial to command-driven IoT systems than semantic caching schemes (GPTCache and vCache), in terms of response accuracy, response time, and token usage.
扫码关注我们
求助内容:
应助结果提醒方式:
