Graph Neural Networks (GNNs) have become core models for learning from relational data in domains such as transportation, social networks, and recommender systems. However, distributed GNN training on large graphs suffers from severe GPU workload imbalance and high communication cost caused by dynamic mini-batch sampling and large structural differences among nodes. To address these challenges, we propose AFS-GNN, a scheduling-aware adaptive framework that achieves fine-grained workload balancing in distributed GNN training. AFS-GNN continuously monitors per-GPU mini-batch execution time through lightweight runtime agents and employs Kalman filtering to suppress transient fluctuations and detect persistent imbalance trends. Upon imbalance detection, it constructs a Hierarchical Dependency Graph (HDG) that explicitly captures multi-hop aggregation dependencies and node-level computational costs. Guided by a heuristic load estimator, AFS-GNN applies cost-aware spectral bipartitioning via the Fiedler vector to select structurally coherent migration blocks that minimize inter-GPU communication while maintaining computational consistency. Selected blocks are migrated asynchronously across devices using intra-node or inter-process communication, ensuring non-blocking execution. Extensive experiments on large-scale benchmarks-ogbn-products and ogbn-papers100M-demonstrate that AFS-GNN achieves up to 21.7% acceleration over Euler, 15% over DistDGL, and 13.7% over FlexGraph, while maintaining stable convergence and scalability across diverse batch sizes and partition configurations.
扫码关注我们
求助内容:
应助结果提醒方式:
