While Large AI Models (LAMs) offer transformative intelligence, their deployment in real-time Internet-of-Things (IoT) applications is constrained by the heterogeneity of edge networks and high cloud latency. To enable cooperative LAM deployment across diverse edge hardware, we propose a modular, hardware-aware framework. Our approach introduces two key innovations. First, a heterogeneity-aware training framework combines Quantization-Aware Training with parameter-efficient federated fine-tuning, reducing communication overhead by over 98% and allowing devices with different hardware (FP32 vs. INT8) to collaboratively refine a single model with near-native accuracy (84.9% vs. 85.5%). Second, a precision-aware inference architecture virtualizes LAM features, including Chain-of-Thought (CoT) steps and Mixture-of-Experts (MoE) layers, into multi-precision microservices. A dynamic orchestrator selects the optimal microservice for each task, balancing energy, latency, and accuracy. Compared to existing edge deployments, experiments show over 70% reduction in active memory usage and nearly 60% reduction in end-to-end inference latency. This framework provides a scalable, privacy-preserving solution for hardware-aware LAM intelligence on heterogeneous edge networks, effectively overcoming key deployment challenges.
扫码关注我们
求助内容:
应助结果提醒方式:
