Pub Date : 2024-03-28DOI: 10.1109/LCA.2024.3406038
Andreas Kosmas Kakolyris;Dimosthenis Masouros;Sotirios Xydis;Dimitrios Soudris
The increasing popularity of LLM-based chatbots combined with their reliance on power-hungry GPU infrastructure forms a critical challenge for providers: minimizing energy consumption under Service-Level Objectives (SLOs) that ensure optimal user experience. Traditional energy optimization methods fall short for LLM inference due to their autoregressive architecture, which renders them incapable of meeting a predefined SLO without energy overprovisioning