{"title":"RAPID: Retrieval and Predictability for Improved Stable Diffusion","authors":"Jingyi Ping;Zhongxing Ming;Laizhong Cui","doi":"10.1109/TCCN.2025.3528895","DOIUrl":null,"url":null,"abstract":"Latent Diffusion Models (LDM) have emerged as a prominent approach within the broader field of generative AI, particularly for consumer-level image generation tasks. These models enable efficient inference of Diffusion Models (DM) by leveraging latent space representations, reducing computational requirements while preserving output quality and flexibility. Advanced sampling algorithms further enhance inference speed and quality, enabling large-scale, low-latency image generation services. However, image generation inference remains time-consuming, and there is no specialized scheduling system in the domain of large-scale image generation models to ensure high resource utilization and latency guarantees. To address this, we introduce a two-stage method of saving intermediate samples, which helps to bypass initial sampling steps and accelerates image generation time. To provide predictable and high-utilization services for large-scale image generation requests, we conduct an in-depth analysis of the LDM structure and find that the response computation time is highly predictable. We further propose RAPID, an online acceleration scheduling framework designed for LDM-based networking request services. RAPID effectively reduces latency and optimizes load balancing across heterogeneous GPUs through precise computation scheduling tailored to specific GPUs. Extensive experiments indicate that RAPID achieves a ~37% increase in inference speed in multi-GPU high-concurrency environments.","PeriodicalId":13069,"journal":{"name":"IEEE Transactions on Cognitive Communications and Networking","volume":"11 2","pages":"1091-1102"},"PeriodicalIF":7.0000,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cognitive Communications and Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10838596/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Latent Diffusion Models (LDM) have emerged as a prominent approach within the broader field of generative AI, particularly for consumer-level image generation tasks. These models enable efficient inference of Diffusion Models (DM) by leveraging latent space representations, reducing computational requirements while preserving output quality and flexibility. Advanced sampling algorithms further enhance inference speed and quality, enabling large-scale, low-latency image generation services. However, image generation inference remains time-consuming, and there is no specialized scheduling system in the domain of large-scale image generation models to ensure high resource utilization and latency guarantees. To address this, we introduce a two-stage method of saving intermediate samples, which helps to bypass initial sampling steps and accelerates image generation time. To provide predictable and high-utilization services for large-scale image generation requests, we conduct an in-depth analysis of the LDM structure and find that the response computation time is highly predictable. We further propose RAPID, an online acceleration scheduling framework designed for LDM-based networking request services. RAPID effectively reduces latency and optimizes load balancing across heterogeneous GPUs through precise computation scheduling tailored to specific GPUs. Extensive experiments indicate that RAPID achieves a ~37% increase in inference speed in multi-GPU high-concurrency environments.
期刊介绍:
The IEEE Transactions on Cognitive Communications and Networking (TCCN) aims to publish high-quality manuscripts that push the boundaries of cognitive communications and networking research. Cognitive, in this context, refers to the application of perception, learning, reasoning, memory, and adaptive approaches in communication system design. The transactions welcome submissions that explore various aspects of cognitive communications and networks, focusing on innovative and holistic approaches to complex system design. Key topics covered include architecture, protocols, cross-layer design, and cognition cycle design for cognitive networks. Additionally, research on machine learning, artificial intelligence, end-to-end and distributed intelligence, software-defined networking, cognitive radios, spectrum sharing, and security and privacy issues in cognitive networks are of interest. The publication also encourages papers addressing novel services and applications enabled by these cognitive concepts.